[Final Report] Final grant report (November 2019 - January 2020)

juraj.chain · February 4, 2020, 7:51pm

This report summarizes what I (Juraj Hlista) worked on as a grant recipient between November 2019 and January 2020.

Peer management

There are ~10000 dead peers on the testnet with ~50 live nodes, and ~25000 dead
peers on mainnet with ~100 live nodes. Most of the dead peers are kept in peer
pools of the nodes.

p2p network simulations

Using the p2p network simulator there have been done simulations using different parameters to see how the p2p behaves and what could be improved. Also, these parameters were made configurable in the node (PR #3074).

These changes are slow to deploy as the simulator doesn’t simulate the real environment where there are different versions of nodes and there is a high risk of making the network behave even worse.

Peer connections

In order to see the number of currently connected outbound/inbound peers there was added a new periodic log message (PR #3029) and the HTTP /status endpoint includes this information as well (PR #3034).

When the node disconnects from a peer specified in its config, the node should connect to this peer again immediatelly (PR #3076).

Ping message

The ping message has max limit of peers it can contain, however, this hasn’t been enforced by the node (PR #3090).

The peers that were part of the ping were taken from both verified and unverified pool which resulted in propagation of dead peers. This was changed so only peers from the verified pool are propagated (PR #3123).

Peer pool

The unverified pool includes too many dead peers. In order to clean them up quicker, the filtering function which checks when a peer was updated (received via ping) the last time and when this time period is longer than max_update_lapse, such peer is removed more frequently (PR #3117, PR #3138).

When the node tries to establish a connection it selects an available peer from the verified pool first, if there is no available peer, it selects a random peer from the unverified pool (most likely dead one). In order to increase the likelihood that the node is able to establish a connection, there is now a mechanism that periodically selects a random peer from the unverified pool and tries to connect to it (just TCP) to find out if it’s alive. If the peer is alive, it’s moved to the verified pool, so the node will have some available peer in the verified pool (PR #3135).

The next step would probably be having a reputation system for the peers, if a connected peer starts to behave maliciously, it should be disconnected and its IP banned for a while…

Sync

Chain generator

This is an internal tool used for load tests. It’s possible to specify how many and what transactions are sent to a node during a predefined time interval and retrieve some statistics at the end of the test. This tool was modified so it can also be used for chain generation and measuring how long it takes to sync the generated chain with a node started with an empty database (PR #2991). Right now, only spend transactions are supported.

Parallel block/transaction validations

There have been complaints regarding the sync being slow so there were done some
measurements (PR #3046, PR #3059), which show what parts of the sync could be improved (block and transaction validations).

Block and transaction validations (especially signature verification) are expensive and sometimes performed twice (PR #3060). There is one erlang process - aec_conductor, which controls addition of new blocks into the chain. On the other hand, there are multiple erlang processes (one per each peer connection) that receive blocks during the sync. Some validations done in the aec_conductor can be moved into those peer connection processes. It would offload the aec_conductor process making it handle more requests to add a new block and at the same time the validations can be performed in parallel in the peer connection processes. This task is not finished yet, it requires changes in the most critical part of the node that validates what is written to the database (WIP branch).

Other

Miner signalled consensus upgrade - presentation

This was the presentation at the public meetup in Sofia.

Investigation of sync issues

There have been situations where the node just stopped syncing or it got
disconnected due to:

database inconsistency error;
network issues;
trying to connect to dead peers.

HTTP /status endpoint - miner activated protocol

When a node switches to a new protocol activated by miner signalling, this protocol will be visible in the HTTP /status endpoint (PR #2982).