[Completed] Aeternity node maintenance - iris hard fork release candidate

dimitar.chain · September 14, 2020, 10:04am

Our progress for the past week:

Ulf Wiger @uwiger

Had worked 39 hours. Mainly worked on Issue #3283 - Expose chain events in contract calls. A Work-In-Progress PR has been pushed. Events are collected and can be subscribed to as internal events. Some work is still required in order to debug the HTTP endpoint for dry-run, where chain events can now be optionally reported. Also, some review and discussion is needed regarding the format of events, and whether some additional event types should be reported.

Dincho Todorov @dincho.chain

Had worked 25 hours. He prepared infrastructure for a release and healed the nodes, he increased their disk size. He investigated the infrastructure alerts and he spent some time on GitHub issue 3301 - HTTP cache tests

Dimitar Ivanov @dimitar.chain

Had worked 44 hours. Those were mostly spent on adding more and more tests to the aesc_utils_tests suite. I’ve identified some improvement points and added missing function specs. I’ve also prepared the 5.5.5 release PR.

gorbak25 · September 14, 2020, 10:34am

Hi!

Regarding maintenance the priorities of the Hyperchain projects are as follows. Maintenance tasks are grouped in 3 categories:

Required for hyperchains(order of decreasing priorities):

Onchain protocol fixes:
If a issue/bug is discovered which affects the onchain protocol then this should be fixed ASAP. This includes bugs in the FATE VM etc…
Fast synchronization:
Right now all nodes operate as “archive” nodes, most people want to quickly get the latest state so optional fast synchronization algorithms need to be implemented and provided as an option for users/node operators - after fast sync is done it would be nice to optionally sync older states(configurable policies as in geth)
Dead peer eviction:
Currently only 1% of the peers in our peer pool is alive - we need to quicker evict dead peers and validate newly gossiped peers - a queue of unchecked peers to be validated seems like a good idea.
Peer pool persistance:
Currently after the node restarts we start retrieving the peer pool from scratch - this is bad, we need to persist the peer pool and possibly after restarting revalidate the peers
Client endpoint for retrieving the peer pool status
Currently there is no easy way for a node maintainer to retrieve the list of peers in the peer pool besides attaching a console to the erlang node and writing some code… This essentially means that it is hard to analyze the status of the network and provide people with seed nodes not affiliated with the Ansalt
Decrease the reliance on centralized seed nodes for bootstraping the node - maintain and provide the community with the peer list(possibly posting the list in the forum, a smart contract etc…)

Nice to have

Regulating the naming system
Deprecate AEVM (maybe remove existing aevm contracts after a public governance vote?)
Querying remote nodes for their version

Not necessary

All bugs in the client software bundled inside the node - stratum, SC FSM, etc…

dimitar.chain · September 14, 2020, 11:28am

Thank you @gorbak25 for the feedback! The idea of this proposal is to help the AE Ecosystem and since the Hyperchain project is one of the most promising ones out there, we are to put your set of required issues to be with highest priority. Once we clear those we will proceed with the rest of the tasks in the proposal. I’ve created issues accordingly and included them in the maintenance project scope. Please note that the dead peers’ one is already included.

@Lydia and @Tina please consider those tasks as part of the proposal above as well.

Tina · September 14, 2020, 2:01pm

Hi @dimitar.chain @gorbak25
Thank you for the proposal and sharing the ideas here. The foundation can prolong the maintenance contract for the next month. Please discuss and finalize the proposal which Hyperchains project can benefit from the maintenance work.

Best
Tina

dimitar.chain · September 14, 2020, 2:44pm

@gorbak25 for completeness regarding your bullet points from your required section:

Yes, those must be fixed ASAP, please consult the bigger chunk of the proposal above.

Yes, this would indeed be great to have. Since this is a non-technical issue, I can not track this in GitHub. Note that this depends heavily on the other peers’ tasks - the dead peers one and the endpoint sone. I consider this to be dangerous before that and we will not start doing those forum posts with seed peers before those are in place.

@Tina from the post above, the proposal is for 2 months:

As stated, I am afraid we won’t be able to solve some bigger issues, esp. regarding dead peers and the fast sync. I think the dead peers is currently the most important issue out there and it could hurt the network. We didn’t touch it in the previous grant period because I am concerned that it could easily spill over one month.

gorbak25 · September 14, 2020, 2:53pm

Really looking forward to get fast sync implemented - otherwise without hosting a trustful and centralized DB backup service HC stakers will need to wait weeks for sync to complete
Yup onchain bugs (with exception to deprecated features as AEVM need to be fixed ASAP).
Decentralizing the seed nodes can indeed only be done after we fix all sync related issues - I wonder how the central part of our node ended up in such bad state

One more thing I forget to add to the above list of tasks (it is “a nice to have” but not required):

upgrade Rocksdb -> This is a low hangling fruit which slows down Q/A significantly.

uwiger · September 14, 2020, 2:53pm

This will be especially true now, since we have some other non-trivial high-priority issues to tackle as well. It’s also not just just a question of man-hours: we will want to utilize the (considerable) expertise of @hanssv.chain, and he has limited availability, so this will add lead time.

dimitar.chain · September 14, 2020, 2:57pm

Well it is working suboptimal but it is still working and there had been just a handful of issues so far The key is the so far part.

…and yes, a RocksDB upgrade is on the list, please consult the first part of the proposal above.

gorbak25 · September 14, 2020, 3:00pm

Well, it’s hard to call something “working suboptimally” when in fact the node is incapable to sync with alive peers other than our seed nodes as our “ping a random node” has only a 1% chance of succeeding…

dincho.chain · September 14, 2020, 3:00pm

This is not fully true, as there is garbage collector that can be enabled to purge old records in the database. However, nodes cannot sync in “light” mode.

dimitar.chain · September 14, 2020, 3:36pm

Indeed it is a bug and there had been attempts in the past to fix it. There had been some issues found both in peer pool and sync and possibly those had some positive effects but they didn’t solve the main issue there. It is not going to be trivial to validate a probable fix and this would require some involvement of @dincho.chain as well. One knows an issue is hard when you need Dincho’s assistance.

hanssv.chain · September 14, 2020, 4:09pm

If you configure your node properly (or not at all) you will hit the seed nodes immediately, they are generously configured so I’m not sure it isn’t such a big issue as you pretend it to be. Just for the sake of this discussion I restarted my own node (had 23 inbound and 24 outbound connections)… Within seconds I had 15+ inbound connections and within a couple of minutes I had close to the numbers that I had before shutting down.

That said, it surely is high time for some improvements, it has been on the agenda several times before but other things got higher priority at the time…

Tina · September 15, 2020, 10:27am

@dimitar.chain and the team,
please send the foundation your new proposal with the task list (please make sure the requests from the Hyperchains are throughly discussed, prioritized).
Thank you.

uwiger · September 21, 2020, 9:58am

I have updated PR #3353 so that it seems to pass with some very simplistic additions to the aehttp_contracts_SUITE. See the comment in the PR, and please chime in on whether you would like to see some different format in the HTTP API.

uwiger · September 24, 2020, 4:46pm

I’ve pushed a first iteration of a PR for an external HTTP endpoint listing the connected peers.

github.com/aeternity/aeternity

External endpoint: /peers/connected

aeternity:master ← aeternity:gh3357-peers-endpoints

opened 04:38PM - 24 Sep 20 UTC

uwiger

+157 -21

See issue #3357 There _is_ an internal endpoint, `/debug/peers`, which basic…ally returns the same thing (at least the pubkeys of connected peers). I added an external endpoint which returns an array of objects, currently containing `pub_key`, `host` and `port` of the relevant peers. Example, from the new test `aehttp_integration_SUITE:get_connected_peers/1`: ```erlang connected_peers (dev1) = [{default, {ok,200, #{<<"connected_peers">> => [#{<<"host">> => <<"localhost">>, <<"port">> => 3025, <<"pub_key">> => <<"pp_23YdvfRPQ1b1AMWmkKZUGk2cQLqygQp55FzDWZSEUicPjhxtp5">>}, #{<<"host">> => <<"localhost">>, <<"port">> => 3035, <<"pub_key">> => <<"pp_2M9oPohzsWgJrBBCFeYi3PVT4YF7F2botBtq6J1EGcVkiutx3R">>}]}}}, {all, {ok,200, #{<<"connected_peers">> => [#{<<"host">> => <<"localhost">>, <<"port">> => 3025, <<"pub_key">> => <<"pp_23YdvfRPQ1b1AMWmkKZUGk2cQLqygQp55FzDWZSEUicPjhxtp5">>}, #{<<"host">> => <<"localhost">>, <<"port">> => 3035, <<"pub_key">> => <<"pp_2M9oPohzsWgJrBBCFeYi3PVT4YF7F2botBtq6J1EGcVkiutx3R">>}]}}}, {inbound, {ok,200, #{<<"connected_peers">> => [#{<<"host">> => <<"localhost">>, <<"port">> => 3025, <<"pub_key">> => <<"pp_23YdvfRPQ1b1AMWmkKZUGk2cQLqygQp55FzDWZSEUicPjhxtp5">>}, #{<<"host">> => <<"localhost">>, <<"port">> => 3035, <<"pub_key">> => <<"pp_2M9oPohzsWgJrBBCFeYi3PVT4YF7F2botBtq6J1EGcVkiutx3R">>}]}}}, {outbound, {ok,200, #{<<"connected_peers">> => [#{<<"host">> => <<"localhost">>, <<"port">> => 3025, <<"pub_key">> => <<"pp_23YdvfRPQ1b1AMWmkKZUGk2cQLqygQp55FzDWZSEUicPjhxtp5">>}, #{<<"host">> => <<"localhost">>, <<"port">> => 3035, <<"pub_key">> => <<"pp_2M9oPohzsWgJrBBCFeYi3PVT4YF7F2botBtq6J1EGcVkiutx3R">>}]}}}] ``` I added a three-node configuration for the `peer_endpoints` test group, in preparation for further tests. Also, the peer pool persistence issue (#3356) may warrant other parameters for each peer, which could be added to the result.

Comments on content and format are welcome.

dimitar.chain · September 28, 2020, 4:03pm

Our progress for the past week

Ulf Wiger @uwiger

The active PR on exposing chain events in contract calls now passes tests locally. Some caches probably need updating for CI to pass. Awaiting comments on which.
A first iteration of presenting connected peers in the REST API has been pushed. Some initial comments received from @gorbak25 will be addressed in further revisions.
Also started looking at fast synch. This is a complex issue, and will take time. In parallel, some initial work was done on fixing the error arising when starting from a DB backup and then synching. The problem may actually lie in the synch code - awaiting some feedback from @hanssv.chain. Finally, some work was also done on peer pool persistence. This is also a complex issue, since a bad implementation could well make nodes more - not less - vulnerable to certain kinds of attack.

Time spent: 38.1 hours

Dimitar Ivanov @dimitar.chain

Had been working on dead peers issue. I did a through analysis and ran some tests. I’ve identified two distinct issues that could be causing the bug.

Time spent: 36.25 hours

@hanssv.chain’s contract is still not approved by the Foundation and he had not worked on this last week. @dincho.chain hadn’t contributed either as his time was consumed by another project.

uwiger · September 29, 2020, 7:22am

I have removed the WIP label from the PR to expose chain transactions in contract calls.

Please give feedback on format and scope. Are there other txs that you would like to see exposed?
Cc @hanssv.chain @marco.chain @philipp.chain

dimitar.chain · September 29, 2020, 7:24am

also @karol.chain and @Arthur as the MDW would index those

marco.chain · September 29, 2020, 12:37pm

can’t give feedback about the format right now. but in general I think we should be able to discover all transactions that can be performed within contract calls.

e.g. we will probably perform lots of AENS related transactions within a smart contract in future versions of aenalytics and it would be nice to be able to see what’s going on “behind the scenes” (without having to rely on contract events which can be used as workaround right now).

nice to see that this topic is finally being tackled!

gorbak25 · September 29, 2020, 12:49pm

Yup especially I’ve started playing with a spend to many contract on mainnet
https://mainnet.aeternal.io/contracts/transactions/ct_6BmTGCnxjXn9quDgR8cAhzoW7nvQLuLfTrDNBpkK9oyCf7zwB
And I have no idea whether those transfers got properly executed…