[Completed] Aeternity node maintenance - iris hard fork release candidate

gorbak25 · September 10, 2020, 10:46am

Hi!

I’m speaking as the current lead of the Hyperchain project. I want to emphasize the importance and priority of the maintenance project. It’s not about introducing new features but about keeping the AE ecosystem alive. Currently Aeternity is not only developing new cutting edge products like Hyperchains or Superhero but is also a service provider - SDK, Middleware, Seed Nodes, DB snapshots, Monitoring etc… This proposal is in simple terms “Hey, we need to keep our Core Infrastructure Alive, have someone ready who can fix something in case of an emergency and fix existing bugs”
CC: @Lydia @Tina @YaniUnchained

If the 2 month extension is not approved(possibly THIS week, a simple “Hey, please work on this while we handle the bureaucracy” will be enough) then my team will need to do a lot of those tasks in the scope of Hyperchains in order to release a finished product, which will extend the ETA for releasing hyperchains by possibly months. What I would really like to see done(which can be labelled as General Node Maintenace) before releasing HC is:

Rocksdb upgrade -> performance will increase and the Q/A process will be speed up which will save us a lot of time
Drop windows support -> I don’t think anybody is using that, will speed up Q/A
Transient failures in the SC test suite -> those tests slow us down due to the possibility of rerunning the entire Q/A process
Sync: cleanup dead peers -> This needs to be done because curently we practically never evict dead peers from the peer pool and we only have 1% of active peers here -> this essentially makes the AE network centralized and unsafe…
Sync: fast sync -> Sync can take weeks… We can speed up things by compromising security slightly - this would allow us to drop the centralized DB backup service…
Sync: peer persistance -> If you restart the node then you need to sync the peer pool again which essentially opens you up to eclipse attacks, on the other hand because only 1% of the peers in the pool are actually active this essentially would mean that after an restart it would be inpossible to sync…
Deprecate AEVM -> it clogs up the codebase and should never be used in HC as we have the FATE VM
Make inner transaction of PayingForTx non-valid -> This needs to be fixed as this bug will propagate to all Hyprchains
FATE cannot get blockhash of current generation -> This decreases usefulness of Sophia smart contracts
Crash in aec_chain_metrics_probe
Dev mode -> Actually we started implementing more or less this because otherwise we are unable to test HC properly - currently @radrow.chain is refactoring the SC chain simulator to allow it to be used in the scope of HC

There are other issues which the HC team could tackle but they can be postponed for later(not necessary for the MVP or HC). Keep in mind that any bug in the Node will propagate to Hyperchains and it will be hard to fix them later in hyperchains as we have no control over each individual hyperchain.

Best Regards,
Grzegorz

dimitar.chain · September 10, 2020, 11:04am

Although this had been discussed many times already, those are not even tracked as issues.

radrow.chain · September 10, 2020, 11:11am

I totally support all the proposed tasks. They are all very valuable, and some of them are completely necessary to me (like dev mode (however I am working on something similar at this moment), rocksdb update, fast sync, not even mentioning bugfixes).

Healthy ecosystem is crucial for all of the development we are doing here – not only limited to Hyperchains or Superhero. Writing more serious things requires more serious testing and more flexible (and bug free) environment. While I was working on the staking contract I really felt some of these issues being a chain on my feet – especially the testing part. We get really distracted by situations when something fails in the network and requires discussing what is the maintenance team allowed to fix and what is not. In my opinion, some emergency maintenance budget should be set as well. It is very important to speed up the approval process, as it is mostly work that is required to do other tasks. The HC team has its own things to do and won’t be able to handle all the issues mentioned here keeping reasonable delivery time. And especially, we can’t just ignore them because we don’t want them to propagate into HCs (like for example AEVM support).

From my side as an iris target I would also add Create contracts from other contracts · Issue #197 · aeternity/aesophia · GitHub – this would have a huge impact on aepps development and would drastically increase reliability of the repetetive smart contract models (like bonding curve tokens or hyperchains staking contracts).

This is not an iris target (cause it doesn’t need a hard fork), but will be priceless during further smart contract development: FATE debugger · Issue #201 · aeternity/aesophia · GitHub.

marco.chain · September 10, 2020, 12:39pm

I’d love to see those tasks being approved. very nice to see increasing activity of the core team in the forum!

we (kryptokrauts) need the iris hard fork as soon as possible to be able to introduce cool features in regards to the naming system (e.g. name extender, name bazaar)

hanssv.chain · September 10, 2020, 2:41pm

Just so you don’t misunderstand the “Deprecate AEVM” task, for Hyperchains you can remove AEVM fully, but the Aeternity core node has to keep it. But there won’t be any new AEVM contracts allowed on chain.

uwiger · September 11, 2020, 8:22pm

I have pushed a WIP (Work In Progress) PR for exposing chain events from contract calls.
There are still some issues, e.g. when returning events to the HTTP client.

There may also be some event needed at contract setup (see @hanssv.chain comments).
If the maintenance project is extended, I can continue next week.

dimitar.chain · September 14, 2020, 10:04am

Our progress for the past week:

Ulf Wiger @uwiger

Had worked 39 hours. Mainly worked on Issue #3283 - Expose chain events in contract calls. A Work-In-Progress PR has been pushed. Events are collected and can be subscribed to as internal events. Some work is still required in order to debug the HTTP endpoint for dry-run, where chain events can now be optionally reported. Also, some review and discussion is needed regarding the format of events, and whether some additional event types should be reported.

Dincho Todorov @dincho.chain

Had worked 25 hours. He prepared infrastructure for a release and healed the nodes, he increased their disk size. He investigated the infrastructure alerts and he spent some time on GitHub issue 3301 - HTTP cache tests

Dimitar Ivanov @dimitar.chain

Had worked 44 hours. Those were mostly spent on adding more and more tests to the aesc_utils_tests suite. I’ve identified some improvement points and added missing function specs. I’ve also prepared the 5.5.5 release PR.

gorbak25 · September 14, 2020, 10:34am

Hi!

Regarding maintenance the priorities of the Hyperchain projects are as follows. Maintenance tasks are grouped in 3 categories:

Required for hyperchains(order of decreasing priorities):

Onchain protocol fixes:
If a issue/bug is discovered which affects the onchain protocol then this should be fixed ASAP. This includes bugs in the FATE VM etc…
Fast synchronization:
Right now all nodes operate as “archive” nodes, most people want to quickly get the latest state so optional fast synchronization algorithms need to be implemented and provided as an option for users/node operators - after fast sync is done it would be nice to optionally sync older states(configurable policies as in geth)
Dead peer eviction:
Currently only 1% of the peers in our peer pool is alive - we need to quicker evict dead peers and validate newly gossiped peers - a queue of unchecked peers to be validated seems like a good idea.
Peer pool persistance:
Currently after the node restarts we start retrieving the peer pool from scratch - this is bad, we need to persist the peer pool and possibly after restarting revalidate the peers
Client endpoint for retrieving the peer pool status
Currently there is no easy way for a node maintainer to retrieve the list of peers in the peer pool besides attaching a console to the erlang node and writing some code… This essentially means that it is hard to analyze the status of the network and provide people with seed nodes not affiliated with the Ansalt
Decrease the reliance on centralized seed nodes for bootstraping the node - maintain and provide the community with the peer list(possibly posting the list in the forum, a smart contract etc…)

Nice to have

Regulating the naming system
Deprecate AEVM (maybe remove existing aevm contracts after a public governance vote?)
Querying remote nodes for their version

Not necessary

All bugs in the client software bundled inside the node - stratum, SC FSM, etc…

dimitar.chain · September 14, 2020, 11:28am

Thank you @gorbak25 for the feedback! The idea of this proposal is to help the AE Ecosystem and since the Hyperchain project is one of the most promising ones out there, we are to put your set of required issues to be with highest priority. Once we clear those we will proceed with the rest of the tasks in the proposal. I’ve created issues accordingly and included them in the maintenance project scope. Please note that the dead peers’ one is already included.

@Lydia and @Tina please consider those tasks as part of the proposal above as well.

Tina · September 14, 2020, 2:01pm

Hi @dimitar.chain @gorbak25
Thank you for the proposal and sharing the ideas here. The foundation can prolong the maintenance contract for the next month. Please discuss and finalize the proposal which Hyperchains project can benefit from the maintenance work.

Best
Tina

dimitar.chain · September 14, 2020, 2:44pm

@gorbak25 for completeness regarding your bullet points from your required section:

Yes, those must be fixed ASAP, please consult the bigger chunk of the proposal above.

Yes, this would indeed be great to have. Since this is a non-technical issue, I can not track this in GitHub. Note that this depends heavily on the other peers’ tasks - the dead peers one and the endpoint sone. I consider this to be dangerous before that and we will not start doing those forum posts with seed peers before those are in place.

@Tina from the post above, the proposal is for 2 months:

As stated, I am afraid we won’t be able to solve some bigger issues, esp. regarding dead peers and the fast sync. I think the dead peers is currently the most important issue out there and it could hurt the network. We didn’t touch it in the previous grant period because I am concerned that it could easily spill over one month.

gorbak25 · September 14, 2020, 2:53pm

Really looking forward to get fast sync implemented - otherwise without hosting a trustful and centralized DB backup service HC stakers will need to wait weeks for sync to complete
Yup onchain bugs (with exception to deprecated features as AEVM need to be fixed ASAP).
Decentralizing the seed nodes can indeed only be done after we fix all sync related issues - I wonder how the central part of our node ended up in such bad state

One more thing I forget to add to the above list of tasks (it is “a nice to have” but not required):

upgrade Rocksdb -> This is a low hangling fruit which slows down Q/A significantly.

uwiger · September 14, 2020, 2:53pm

This will be especially true now, since we have some other non-trivial high-priority issues to tackle as well. It’s also not just just a question of man-hours: we will want to utilize the (considerable) expertise of @hanssv.chain, and he has limited availability, so this will add lead time.

dimitar.chain · September 14, 2020, 2:57pm

Well it is working suboptimal but it is still working and there had been just a handful of issues so far The key is the so far part.

…and yes, a RocksDB upgrade is on the list, please consult the first part of the proposal above.

gorbak25 · September 14, 2020, 3:00pm

Well, it’s hard to call something “working suboptimally” when in fact the node is incapable to sync with alive peers other than our seed nodes as our “ping a random node” has only a 1% chance of succeeding…

dincho.chain · September 14, 2020, 3:00pm

This is not fully true, as there is garbage collector that can be enabled to purge old records in the database. However, nodes cannot sync in “light” mode.

dimitar.chain · September 14, 2020, 3:36pm

Indeed it is a bug and there had been attempts in the past to fix it. There had been some issues found both in peer pool and sync and possibly those had some positive effects but they didn’t solve the main issue there. It is not going to be trivial to validate a probable fix and this would require some involvement of @dincho.chain as well. One knows an issue is hard when you need Dincho’s assistance.

hanssv.chain · September 14, 2020, 4:09pm

If you configure your node properly (or not at all) you will hit the seed nodes immediately, they are generously configured so I’m not sure it isn’t such a big issue as you pretend it to be. Just for the sake of this discussion I restarted my own node (had 23 inbound and 24 outbound connections)… Within seconds I had 15+ inbound connections and within a couple of minutes I had close to the numbers that I had before shutting down.

That said, it surely is high time for some improvements, it has been on the agenda several times before but other things got higher priority at the time…

Tina · September 15, 2020, 10:27am

@dimitar.chain and the team,
please send the foundation your new proposal with the task list (please make sure the requests from the Hyperchains are throughly discussed, prioritized).
Thank you.

uwiger · September 21, 2020, 9:58am

I have updated PR #3353 so that it seems to pass with some very simplistic additions to the aehttp_contracts_SUITE. See the comment in the PR, and please chime in on whether you would like to see some different format in the HTTP API.