[Active] Aeternity node maintenance - iris hard fork release candidate

@hanssv.chain and @marco.chain Thank you for the reply!
@uwigeroferlang.chain Please assign this issue.
Who can take over the important Sync: cleanup dead peers #3290 ?

Done, and Hans and Radek removed as assignees.

2 Likes

I also suggest not to move onto OTP 23 now, unless any new language feature from OTP 23 is a hard requirement, which doesn’t seem to be the case. Migrating to OTP 22.3 is a lot of work already, considering all dependencies need to be updated and potentially fixed too. It is a stable target to work against, whereas OTP 23 is still fresh and there will be bugs and incompatibilities ahead.

5 Likes

Hi guys,

I have been working on State channel for quite sometime. And been noticing a few issues here and there (like sometime socket getting disconnected etc.) but could not really figure out the root cause hence did not post. All I can say, it does not seems to be node issue

I am in touch with a company in India with whom I am working parallely on using state channel protocol for a use case. We also have added a couple of RPCs and preparing a demo call with you guys so that I can explain need of those RPC and after that we will raise the PR if it makes sense.

AE state channels are currently in highlight atleast within my network here in India since I have been promoting it for quite sometime.

6 Likes

Our progress for the past week:

@uwigeroferlang.chain worked 36.5 hours
PR #3292 Pluggable core functionality
This feature is a cornerstone of the Hyperchains work, and has now been merged into master (cooperation between the Hyperchains team and the maintenance project)
PR #3294 Use parse_transform w -pluggable() attrs
This PR is a prerequisite for #3292 above, and has now been merged into master (cooperation between the Hyperchains team and the maintenance project)
PR #3341 Update deps and CircleCI for OTP 22
CI is now up and running for Aeternity on OTP 22. We disabled a job for OTP 23, since there are still some build issues there.
Issue #3283 Expose chain “transactions” in contract calls
This is progressing, but not yet ready.

@dimitar.chain worked 38.75 hours:
ForceProgress transaction has no info on-chain #3229
Finalised it
Missing test SUITE: aesc_utils #3285
I’ve added a few dozens of tests, but a few dozens yet to be added. I’ve found some small improvement points in the code and I’ve addressed those accordingly.

@dincho.chain spent 15 hours on modifying the CI to make it run OTP22 by default. Docker builds were adjusted as well.

4 Likes

Since this is our last week on this proposal, on Monday we will share our last progress.

In the past month we’ve accomplished a lot and we are happy with our progress there, especially given it was Ulf and me doing the coding and Dincho the DevOps. On this basis we would like to share with you our proposal for the next 2 months. We propose a bigger timeframe so we can tackle some bigger tasks. What is more, Hans can help us out as well. At the moment he can not dedicate more than 16h a week, hopefully this would change for the better.

Below you can find our horizon of tasks for the next 2 months. Please note that we don’t commit that we would do all of those in the timeframe but rather this is the order we would tackle tasks.

So this is our proposal :slight_smile: It is up to the foundation to decide if they would like to support it or not. cc @Lydia and @Tina

Ulf Wiger @uwigeroferlang.chain

Update rocksdb to 6.4.6

The latest version of erlang-rocksdb supports Rocksdb 6.5.2 Our system currently uses erlang-rocksdb 0.24.0, which uses Rocksdb 5.15.10. A new release should be forthcoming, also adapting the Erlang part to OTP 23. We want to move to a newer Rocksdb not least because Rocksdb takes up a large part of the Aeternity build time. Also, lots of bugfixes and performance improvements have been introduced in later Rocksdb versions.

When syncing from backup, accept previous states in DB if they don’t differ

This would improve things for the Middleware, avoiding unnecessary problems during database import.

Rest API endpoints version prefix

This is regular technical debt, and should be fixed.

Dev mode

Supporting “dev mode” (fake) mining instead of running light cuckoo cycle mining. A prototype for this can be said to exist in the test suites, where this is achieved through mocking.

Data and log locations should be configurable from other location

This would be helpful for plugin applications, and should not be too hard to implement.

Unhandled error in aec_chain_metrics_probe

Probably a rare error, but should be easy to fix. Though the origin of the error is unknown, so testing may be a bit tricky, and addressing the root cause even more so. What we can begin to do is to make the metric probe more robust.

More flexible/file-less configuration

This would simplify testing and deployment of closed systems, and should be easy to implement (testing may take a little bit more time).

Allow configuration by OS environment variables

This would simplify test setup and development environments. The best way to address it may be to refactor some of the legacy code which checks configuration data. The methods of handling config data evolved over time, and the code reflects this.

aehttp_sc_SUITE failure: timeout waiting for channel open messages

This bug was detected during the maintenance project, and causes intermittent failures in the CI. It should be fixed, should not take more than 1-2 man-days.

The following issues are broken-down tasks from the already approved issue #3194 (https://github.com/aeternity/aeternity/issues/3194)

State Channels: Inactivity timer in chain watcher

State Channels: Client can ask FSM to quit waiting for minimum depth

State Channels: modifiable minimum_depth default

Hans Svenson @hanssv.chain

FATE cannot get blockhash of current generation

This an outright bug that should be fixed.

AENS: Review and simplify pointers

Currently name pointers allow too much freedom for the user to be creative. This should be revisited

Make inner transaction of PayingForTx non-valid

This is a bug in the PayingForTx that would render it useless. The attack vector is described in the GitHub issue. This must be done before Iris release.

AENS: Increase the name expiry time

This is something that came up a few times in the forum already: name expiration was never decided by the public. The idea here is to allow the community to vote on when names should expire.

AENS: Fix bug in AENS.update signature check

This is a bug, it must be fixed.

Deprecate AEVM properly for Iris

This one is a technical debt, it should be resolved ASAP.

Dincho Todorov @dincho.chain

Dincho would be providing us with his DevOps skills so he is needed all over the tasks, really. When he is not overloaded with work, he will be cleaning the issues assigned to him:

Dimitar Ivanov @dimitar.chain

Sync: cleanup dead peers

This bug had beem around for long time now. This would be my priority task. There had been a few attempts to expose the bug, so far all of those exposed some issues but didn’t solve it. It is a black box issue and we would not know how much time and effort it would require to fix. It might take 2 weeks or over a month, exactly how much it would take is to determine my availability for the rest of the tasks. A few more issues might be created from this one. I will need Dincho’s help here as well.

HTTP Websockets upgrade regression

This bug is breaking some of the tools used by SRE and should be a low hanging fruit.

Out of sync /status endpoint data

This is a curious bug that points to a race condition in the code. The result is a confusing API that is hard to reason about.

aec_chain_state infinity restarts and crashes

The error recover mechanism seems to be broken, not marked as a bug but it is clearly one. This could result in filling one’s HDD with garbage logs.

meta_tx’s TTL

This bug could result in unexpected results when using generalised accounts: the TTL being used is the one authenticating the inner transaction but it must be the other way around.

Test suite bugs

aest_channels_SUITE ==> test_simple_different_nodes_channel: FAILED badmatch

aehttp_sc_SUITE ==> plain.with_open_channel.sc_ws_update_abort: FAILED timeout

Those are bugs in the test setup.

Drop “native” windows support

Bring the discussion in the forum if the community needs the Windows build and if not - deprecate it.

8 Likes

Hi!

I’m speaking as the current lead of the Hyperchain project. I want to emphasize the importance and priority of the maintenance project. It’s not about introducing new features but about keeping the AE ecosystem alive. Currently Aeternity is not only developing new cutting edge products like Hyperchains or Superhero but is also a service provider - SDK, Middleware, Seed Nodes, DB snapshots, Monitoring etc… This proposal is in simple terms “Hey, we need to keep our Core Infrastructure Alive, have someone ready who can fix something in case of an emergency and fix existing bugs”
CC: @Lydia @Tina @yani.chain

If the 2 month extension is not approved(possibly THIS week, a simple “Hey, please work on this while we handle the bureaucracy” will be enough) then my team will need to do a lot of those tasks in the scope of Hyperchains in order to release a finished product, which will extend the ETA for releasing hyperchains by possibly months. What I would really like to see done(which can be labelled as General Node Maintenace) before releasing HC is:

  • Rocksdb upgrade -> performance will increase and the Q/A process will be speed up which will save us a lot of time
  • Drop windows support -> I don’t think anybody is using that, will speed up Q/A
  • Transient failures in the SC test suite -> those tests slow us down due to the possibility of rerunning the entire Q/A process
  • Sync: cleanup dead peers -> This needs to be done because curently we practically never evict dead peers from the peer pool and we only have 1% of active peers here -> this essentially makes the AE network centralized and unsafe…
  • Sync: fast sync -> Sync can take weeks… We can speed up things by compromising security slightly - this would allow us to drop the centralized DB backup service…
  • Sync: peer persistance -> If you restart the node then you need to sync the peer pool again which essentially opens you up to eclipse attacks, on the other hand because only 1% of the peers in the pool are actually active this essentially would mean that after an restart it would be inpossible to sync…
  • Deprecate AEVM -> it clogs up the codebase and should never be used in HC as we have the FATE VM
  • Make inner transaction of PayingForTx non-valid -> This needs to be fixed as this bug will propagate to all Hyprchains
  • FATE cannot get blockhash of current generation -> This decreases usefulness of Sophia smart contracts
  • Crash in aec_chain_metrics_probe
  • Dev mode -> Actually we started implementing more or less this because otherwise we are unable to test HC properly - currently @radrow is refactoring the SC chain simulator to allow it to be used in the scope of HC

There are other issues which the HC team could tackle but they can be postponed for later(not necessary for the MVP or HC). Keep in mind that any bug in the Node will propagate to Hyperchains and it will be hard to fix them later in hyperchains as we have no control over each individual hyperchain.

Best Regards,
Grzegorz

7 Likes

Although this had been discussed many times already, those are not even tracked as issues.

5 Likes

I totally support all the proposed tasks. They are all very valuable, and some of them are completely necessary to me (like dev mode (however I am working on something similar at this moment), rocksdb update, fast sync, not even mentioning bugfixes).

Healthy ecosystem is crucial for all of the development we are doing here – not only limited to Hyperchains or Superhero. Writing more serious things requires more serious testing and more flexible (and bug free) environment. While I was working on the staking contract I really felt some of these issues being a chain on my feet – especially the testing part. We get really distracted by situations when something fails in the network and requires discussing what is the maintenance team allowed to fix and what is not. In my opinion, some emergency maintenance budget should be set as well. It is very important to speed up the approval process, as it is mostly work that is required to do other tasks. The HC team has its own things to do and won’t be able to handle all the issues mentioned here keeping reasonable delivery time. And especially, we can’t just ignore them because we don’t want them to propagate into HCs (like for example AEVM support).

From my side as an iris target I would also add https://github.com/aeternity/aesophia/issues/197 – this would have a huge impact on aepps development and would drastically increase reliability of the repetetive smart contract models (like bonding curve tokens or hyperchains staking contracts).

This is not an iris target (cause it doesn’t need a hard fork), but will be priceless during further smart contract development: https://github.com/aeternity/aesophia/issues/201.

7 Likes

I’d love to see those tasks being approved. very nice to see increasing activity of the core team in the forum! :slight_smile:

we (kryptokrauts) need the iris hard fork as soon as possible to be able to introduce cool features in regards to the naming system (e.g. name extender, name bazaar)

6 Likes

Just so you don’t misunderstand the “Deprecate AEVM” task, for Hyperchains you can remove AEVM fully, but the Aeternity core node has to keep it. But there won’t be any new AEVM contracts allowed on chain.

5 Likes

I have pushed a WIP (Work In Progress) PR for exposing chain events from contract calls.
There are still some issues, e.g. when returning events to the HTTP client.

There may also be some event needed at contract setup (see @hanssv.chain comments).
If the maintenance project is extended, I can continue next week.

3 Likes

Our progress for the past week:

Ulf Wiger @uwigeroferlang.chain

Had worked 39 hours. Mainly worked on Issue #3283 - Expose chain events in contract calls. A Work-In-Progress PR has been pushed. Events are collected and can be subscribed to as internal events. Some work is still required in order to debug the HTTP endpoint for dry-run, where chain events can now be optionally reported. Also, some review and discussion is needed regarding the format of events, and whether some additional event types should be reported.

Dincho Todorov @dincho.chain

Had worked 25 hours. He prepared infrastructure for a release and healed the nodes, he increased their disk size. He investigated the infrastructure alerts and he spent some time on GitHub issue 3301 - HTTP cache tests

Dimitar Ivanov @dimitar.chain

Had worked 44 hours. Those were mostly spent on adding more and more tests to the aesc_utils_tests suite. I’ve identified some improvement points and added missing function specs. I’ve also prepared the 5.5.5 release PR.

1 Like

Hi!

Regarding maintenance the priorities of the Hyperchain projects are as follows. Maintenance tasks are grouped in 3 categories:

Required for hyperchains(order of decreasing priorities):

  • Onchain protocol fixes:
    If a issue/bug is discovered which affects the onchain protocol then this should be fixed ASAP. This includes bugs in the FATE VM etc…
  • Fast synchronization:
    Right now all nodes operate as “archive” nodes, most people want to quickly get the latest state so optional fast synchronization algorithms need to be implemented and provided as an option for users/node operators - after fast sync is done it would be nice to optionally sync older states(configurable policies as in geth)
  • Dead peer eviction:
    Currently only 1% of the peers in our peer pool is alive - we need to quicker evict dead peers and validate newly gossiped peers - a queue of unchecked peers to be validated seems like a good idea.
  • Peer pool persistance:
    Currently after the node restarts we start retrieving the peer pool from scratch - this is bad, we need to persist the peer pool and possibly after restarting revalidate the peers
  • Client endpoint for retrieving the peer pool status
    Currently there is no easy way for a node maintainer to retrieve the list of peers in the peer pool besides attaching a console to the erlang node and writing some code… This essentially means that it is hard to analyze the status of the network and provide people with seed nodes not affiliated with the Ansalt
  • Decrease the reliance on centralized seed nodes for bootstraping the node - maintain and provide the community with the peer list(possibly posting the list in the forum, a smart contract etc…)

Nice to have

  • Regulating the naming system
  • Deprecate AEVM (maybe remove existing aevm contracts after a public governance vote?)
  • Querying remote nodes for their version

Not necessary

  • All bugs in the client software bundled inside the node - stratum, SC FSM, etc…
6 Likes

Thank you @gorbak25 for the feedback! The idea of this proposal is to help the AE Ecosystem and since the Hyperchain project is one of the most promising ones out there, we are to put your set of required issues to be with highest priority. Once we clear those we will proceed with the rest of the tasks in the proposal. I’ve created issues accordingly and included them in the maintenance project scope. Please note that the dead peers’ one is already included.

@Lydia and @Tina please consider those tasks as part of the proposal above as well.

2 Likes

Hi @dimitar.chain @gorbak25
Thank you for the proposal and sharing the ideas here. The foundation can prolong the maintenance contract for the next month. Please discuss and finalize the proposal which Hyperchains project can benefit from the maintenance work.

Best
Tina

@gorbak25 for completeness regarding your bullet points from your required section:

Yes, those must be fixed ASAP, please consult the bigger chunk of the proposal above.

Yes, this would indeed be great to have. Since this is a non-technical issue, I can not track this in GitHub. Note that this depends heavily on the other peers’ tasks - the dead peers one and the endpoint sone. I consider this to be dangerous before that and we will not start doing those forum posts with seed peers before those are in place.

@Tina from the post above, the proposal is for 2 months:

As stated, I am afraid we won’t be able to solve some bigger issues, esp. regarding dead peers and the fast sync. I think the dead peers is currently the most important issue out there and it could hurt the network. We didn’t touch it in the previous grant period because I am concerned that it could easily spill over one month.

6 Likes

Really looking forward to get fast sync implemented - otherwise without hosting a trustful and centralized DB backup service HC stakers will need to wait weeks for sync to complete :frowning:
Yup onchain bugs (with exception to deprecated features as AEVM need to be fixed ASAP).
Decentralizing the seed nodes can indeed only be done after we fix all sync related issues - I wonder how the central part of our node ended up in such bad state :frowning:

One more thing I forget to add to the above list of tasks (it is “a nice to have” but not required):

  • upgrade Rocksdb -> This is a low hangling fruit which slows down Q/A significantly.
3 Likes

This will be especially true now, since we have some other non-trivial high-priority issues to tackle as well. It’s also not just just a question of man-hours: we will want to utilize the (considerable) expertise of @hanssv.chain, and he has limited availability, so this will add lead time.

1 Like

Well it is working suboptimal but it is still working and there had been just a handful of issues so far :slight_smile: The key is the so far part.

…and yes, a RocksDB upgrade is on the list, please consult the first part of the proposal above.