[Completed] Aeternity node maintenance - iris hard fork release candidate

With CircleCI back up and running, we have identified an intermittent failure in the aehttp_sc_SUITE (State Channels websocket API test suite).

An issue has been created: Issue #3336. I think it would make sense to include that in the maintenance project, so that we can fix it right away. I estimate ca 1-2 man-day(s) of work.

As far as we can tell, it is a bug in the test suite, and not an underlying problem with the FSM implementation. But intermittently failing test runs become a draw on overall productivity.

2 Likes

Our progress for the past week:

@uwiger worked 32.5 hours:
Merged PRs:
Break out chain simulator from chain watcher suite. Fixes: #3322 by uwiger · Pull Request #3329 · aeternity/aeternity · GitHub Break out chain simulator

Reviewed PRs:

Identified new issue affecting CI, mentioned in the forum thread suggesting that we solve it as part of the maintenance project.

Started wiki page describing the challenges of issue #3194

@dimitar.chain worked 36.25 hours:

  • the PRs from the last week were polished a bit and merged
  • SC: websocket is not closed on “invalid fsm id” #3163
    This was was not a bug. I’ve ran different tests and was never able to reproduce it. On the contrary - both the code and the test coverage looks good.
  • SC: Feature addition, allow merchant to subscribe to incoming client requests #3069
    As Ulf commented - this had been implemented some time ago. There weren’t any WebSocket related this feature and I wrote some. This would result in log examples as well.
  • ForceProgress transaction has no info on-chain #3229
    This again is already implemented and tests coverage is good. I ran some additional tests to make sure that the tests really do what they claim to. (Actually I finished with this today, so it is partially in the next week as well)

@dincho.chain had worked 4 hours:

  • Add simplified CircleCI release workflow #3339 (it is not yet reviewed)

I have created a number of issues, breaking down issue #3194:

I would like acknowledgemen from the Foundation that I’m cleared to add these issues to the project scope (strictly speaking, they are a refinement of the scope), and to start working on them. Cc @lydia

The priority in the maintenance is to fix all the bugs and technical issues. It seems that the issues list is increasing due to the state channels. Presently there is no application using them. Therefore this can be done on a later stage and not in the present project.

AF Board welcomes applications on the state channels! May be Dimitar can reconsider his decision and make the app. We will be happy to support any state channels application.

@uwiger please tell us why do we update to OTP 22? Can we not move to OTP 23? What are the problems with OTP 23? There are still unsolved bugs in the OTP 22 too that can create problems. What are the criterias of stability you used? Can we also update deps and CircleCI configuration to OTP 23? Do we have tests to compere with the future releases of OTP 23? Please report your test experience to all of us.

The issue list is only increasing here because I broke down the already approved issue into smaller tasks, which individually can help improve the situation. It would allow for a more limited effort while still adding benefit (i.e. not all those tasks need to be completed in order to deliver improvement).

Having said this, I welcome concrete suggestions on issues that should get higher priority. I intend to do some digging myself and try to come up with some.

And we do have active State Channel users here on the forum. The proposal [withdrawn for other reasons] to create a State Channel Mobile Client was well received, and Hypermine (e.g. @vishwas_hypermine) has actively posted on the forum about their State Channel work, as well as talked about their use of State Channels.

In terms of OTP 22, I have pushed a PR, but this ran into an issue with CircleCI. I am waiting for @dincho.chain to take a look at it.

Regarding OTP 23, I suggest that we wait at least until 23.1 comes out. In the initial tests, the build failed due to a compiler bug. This was quickly fixed by OTP, but this sort of thing is not especially unexpected when trying out an X.0 version of OTP. We are also waiting for updates to erlang-rocksdb that would make it easier to build cleanly on OTP 23.

Even so, our intention is to get CI up and running also using OTP 23. The priority for now should be OTP 22, since it is more mature, and has all the things we currently need.

2 Likes

Here is one issue I’ve been thinking about picking up:

@hanssv.chain and I had a discussion about it before the holidays, and agreed on an approach.

1 Like

The “Expose chain transactions in contract calls” is definitely a good candidate. It is one of those embarrassing technical issues/debt we have around. (Not only saying so since I wrote the issue :slight_smile: )

5 Likes

this is a really important issue! :+1:

I absolutely support this and also the new middleware should be able to handle that information as soon as possible @karol.chain

3 Likes

@hanssv.chain and @marco.chain Thank you for the reply!
@uwiger Please assign this issue.
Who can take over the important Sync: cleanup dead peers #3290 ?

Done, and Hans and Radek removed as assignees.

2 Likes

I also suggest not to move onto OTP 23 now, unless any new language feature from OTP 23 is a hard requirement, which doesn’t seem to be the case. Migrating to OTP 22.3 is a lot of work already, considering all dependencies need to be updated and potentially fixed too. It is a stable target to work against, whereas OTP 23 is still fresh and there will be bugs and incompatibilities ahead.

5 Likes

Hi guys,

I have been working on State channel for quite sometime. And been noticing a few issues here and there (like sometime socket getting disconnected etc.) but could not really figure out the root cause hence did not post. All I can say, it does not seems to be node issue

I am in touch with a company in India with whom I am working parallely on using state channel protocol for a use case. We also have added a couple of RPCs and preparing a demo call with you guys so that I can explain need of those RPC and after that we will raise the PR if it makes sense.

AE state channels are currently in highlight atleast within my network here in India since I have been promoting it for quite sometime.

6 Likes

Our progress for the past week:

@uwiger worked 36.5 hours
PR #3292 Pluggable core functionality
This feature is a cornerstone of the Hyperchains work, and has now been merged into master (cooperation between the Hyperchains team and the maintenance project)
PR #3294 Use parse_transform w -pluggable() attrs
This PR is a prerequisite for #3292 above, and has now been merged into master (cooperation between the Hyperchains team and the maintenance project)
PR #3341 Update deps and CircleCI for OTP 22
CI is now up and running for Aeternity on OTP 22. We disabled a job for OTP 23, since there are still some build issues there.
Issue #3283 Expose chain “transactions” in contract calls
This is progressing, but not yet ready.

@dimitar.chain worked 38.75 hours:
ForceProgress transaction has no info on-chain #3229
Finalised it
Missing test SUITE: aesc_utils #3285
I’ve added a few dozens of tests, but a few dozens yet to be added. I’ve found some small improvement points in the code and I’ve addressed those accordingly.

@dincho.chain spent 15 hours on modifying the CI to make it run OTP22 by default. Docker builds were adjusted as well.

4 Likes

Since this is our last week on this proposal, on Monday we will share our last progress.

In the past month we’ve accomplished a lot and we are happy with our progress there, especially given it was Ulf and me doing the coding and Dincho the DevOps. On this basis we would like to share with you our proposal for the next 2 months. We propose a bigger timeframe so we can tackle some bigger tasks. What is more, Hans can help us out as well. At the moment he can not dedicate more than 16h a week, hopefully this would change for the better.

Below you can find our horizon of tasks for the next 2 months. Please note that we don’t commit that we would do all of those in the timeframe but rather this is the order we would tackle tasks.

So this is our proposal :slight_smile: It is up to the foundation to decide if they would like to support it or not. cc @Lydia and @Tina

Ulf Wiger @uwiger

Update rocksdb to 6.4.6

The latest version of erlang-rocksdb supports Rocksdb 6.5.2 Our system currently uses erlang-rocksdb 0.24.0, which uses Rocksdb 5.15.10. A new release should be forthcoming, also adapting the Erlang part to OTP 23. We want to move to a newer Rocksdb not least because Rocksdb takes up a large part of the Aeternity build time. Also, lots of bugfixes and performance improvements have been introduced in later Rocksdb versions.

When syncing from backup, accept previous states in DB if they don’t differ

This would improve things for the Middleware, avoiding unnecessary problems during database import.

Rest API endpoints version prefix

This is regular technical debt, and should be fixed.

Dev mode

Supporting “dev mode” (fake) mining instead of running light cuckoo cycle mining. A prototype for this can be said to exist in the test suites, where this is achieved through mocking.

Data and log locations should be configurable from other location

This would be helpful for plugin applications, and should not be too hard to implement.

Unhandled error in aec_chain_metrics_probe

Probably a rare error, but should be easy to fix. Though the origin of the error is unknown, so testing may be a bit tricky, and addressing the root cause even more so. What we can begin to do is to make the metric probe more robust.

More flexible/file-less configuration

This would simplify testing and deployment of closed systems, and should be easy to implement (testing may take a little bit more time).

Allow configuration by OS environment variables

This would simplify test setup and development environments. The best way to address it may be to refactor some of the legacy code which checks configuration data. The methods of handling config data evolved over time, and the code reflects this.

aehttp_sc_SUITE failure: timeout waiting for channel open messages

This bug was detected during the maintenance project, and causes intermittent failures in the CI. It should be fixed, should not take more than 1-2 man-days.

The following issues are broken-down tasks from the already approved issue #3194 (Relax restriction that channel cannot be used before min_depth · Issue #3194 · aeternity/aeternity · GitHub)

State Channels: Inactivity timer in chain watcher

State Channels: Client can ask FSM to quit waiting for minimum depth

State Channels: modifiable minimum_depth default

Hans Svenson @hanssv.chain

FATE cannot get blockhash of current generation

This an outright bug that should be fixed.

AENS: Review and simplify pointers

Currently name pointers allow too much freedom for the user to be creative. This should be revisited

Make inner transaction of PayingForTx non-valid

This is a bug in the PayingForTx that would render it useless. The attack vector is described in the GitHub issue. This must be done before Iris release.

AENS: Increase the name expiry time

This is something that came up a few times in the forum already: name expiration was never decided by the public. The idea here is to allow the community to vote on when names should expire.

AENS: Fix bug in AENS.update signature check

This is a bug, it must be fixed.

Deprecate AEVM properly for Iris

This one is a technical debt, it should be resolved ASAP.

Dincho Todorov @dincho.chain

Dincho would be providing us with his DevOps skills so he is needed all over the tasks, really. When he is not overloaded with work, he will be cleaning the issues assigned to him:

Dimitar Ivanov @dimitar.chain

Sync: cleanup dead peers

This bug had beem around for long time now. This would be my priority task. There had been a few attempts to expose the bug, so far all of those exposed some issues but didn’t solve it. It is a black box issue and we would not know how much time and effort it would require to fix. It might take 2 weeks or over a month, exactly how much it would take is to determine my availability for the rest of the tasks. A few more issues might be created from this one. I will need Dincho’s help here as well.

HTTP Websockets upgrade regression

This bug is breaking some of the tools used by SRE and should be a low hanging fruit.

Out of sync /status endpoint data

This is a curious bug that points to a race condition in the code. The result is a confusing API that is hard to reason about.

aec_chain_state infinity restarts and crashes

The error recover mechanism seems to be broken, not marked as a bug but it is clearly one. This could result in filling one’s HDD with garbage logs.

meta_tx’s TTL

This bug could result in unexpected results when using generalised accounts: the TTL being used is the one authenticating the inner transaction but it must be the other way around.

Test suite bugs

aest_channels_SUITE ==> test_simple_different_nodes_channel: FAILED badmatch

aehttp_sc_SUITE ==> plain.with_open_channel.sc_ws_update_abort: FAILED timeout

Those are bugs in the test setup.

Drop “native” windows support

Bring the discussion in the forum if the community needs the Windows build and if not - deprecate it.

8 Likes

Hi!

I’m speaking as the current lead of the Hyperchain project. I want to emphasize the importance and priority of the maintenance project. It’s not about introducing new features but about keeping the AE ecosystem alive. Currently Aeternity is not only developing new cutting edge products like Hyperchains or Superhero but is also a service provider - SDK, Middleware, Seed Nodes, DB snapshots, Monitoring etc… This proposal is in simple terms “Hey, we need to keep our Core Infrastructure Alive, have someone ready who can fix something in case of an emergency and fix existing bugs”
CC: @Lydia @Tina @YaniUnchained

If the 2 month extension is not approved(possibly THIS week, a simple “Hey, please work on this while we handle the bureaucracy” will be enough) then my team will need to do a lot of those tasks in the scope of Hyperchains in order to release a finished product, which will extend the ETA for releasing hyperchains by possibly months. What I would really like to see done(which can be labelled as General Node Maintenace) before releasing HC is:

  • Rocksdb upgrade -> performance will increase and the Q/A process will be speed up which will save us a lot of time
  • Drop windows support -> I don’t think anybody is using that, will speed up Q/A
  • Transient failures in the SC test suite -> those tests slow us down due to the possibility of rerunning the entire Q/A process
  • Sync: cleanup dead peers -> This needs to be done because curently we practically never evict dead peers from the peer pool and we only have 1% of active peers here -> this essentially makes the AE network centralized and unsafe…
  • Sync: fast sync -> Sync can take weeks… We can speed up things by compromising security slightly - this would allow us to drop the centralized DB backup service…
  • Sync: peer persistance -> If you restart the node then you need to sync the peer pool again which essentially opens you up to eclipse attacks, on the other hand because only 1% of the peers in the pool are actually active this essentially would mean that after an restart it would be inpossible to sync…
  • Deprecate AEVM -> it clogs up the codebase and should never be used in HC as we have the FATE VM
  • Make inner transaction of PayingForTx non-valid -> This needs to be fixed as this bug will propagate to all Hyprchains
  • FATE cannot get blockhash of current generation -> This decreases usefulness of Sophia smart contracts
  • Crash in aec_chain_metrics_probe
  • Dev mode -> Actually we started implementing more or less this because otherwise we are unable to test HC properly - currently @radrow.chain is refactoring the SC chain simulator to allow it to be used in the scope of HC

There are other issues which the HC team could tackle but they can be postponed for later(not necessary for the MVP or HC). Keep in mind that any bug in the Node will propagate to Hyperchains and it will be hard to fix them later in hyperchains as we have no control over each individual hyperchain.

Best Regards,
Grzegorz

7 Likes

Although this had been discussed many times already, those are not even tracked as issues.

5 Likes

I totally support all the proposed tasks. They are all very valuable, and some of them are completely necessary to me (like dev mode (however I am working on something similar at this moment), rocksdb update, fast sync, not even mentioning bugfixes).

Healthy ecosystem is crucial for all of the development we are doing here – not only limited to Hyperchains or Superhero. Writing more serious things requires more serious testing and more flexible (and bug free) environment. While I was working on the staking contract I really felt some of these issues being a chain on my feet – especially the testing part. We get really distracted by situations when something fails in the network and requires discussing what is the maintenance team allowed to fix and what is not. In my opinion, some emergency maintenance budget should be set as well. It is very important to speed up the approval process, as it is mostly work that is required to do other tasks. The HC team has its own things to do and won’t be able to handle all the issues mentioned here keeping reasonable delivery time. And especially, we can’t just ignore them because we don’t want them to propagate into HCs (like for example AEVM support).

From my side as an iris target I would also add Create contracts from other contracts · Issue #197 · aeternity/aesophia · GitHub – this would have a huge impact on aepps development and would drastically increase reliability of the repetetive smart contract models (like bonding curve tokens or hyperchains staking contracts).

This is not an iris target (cause it doesn’t need a hard fork), but will be priceless during further smart contract development: FATE debugger · Issue #201 · aeternity/aesophia · GitHub.

7 Likes

I’d love to see those tasks being approved. very nice to see increasing activity of the core team in the forum! :slight_smile:

we (kryptokrauts) need the iris hard fork as soon as possible to be able to introduce cool features in regards to the naming system (e.g. name extender, name bazaar)

6 Likes

Just so you don’t misunderstand the “Deprecate AEVM” task, for Hyperchains you can remove AEVM fully, but the Aeternity core node has to keep it. But there won’t be any new AEVM contracts allowed on chain.

5 Likes

I have pushed a WIP (Work In Progress) PR for exposing chain events from contract calls.
There are still some issues, e.g. when returning events to the HTTP client.

There may also be some event needed at contract setup (see @hanssv.chain comments).
If the maintenance project is extended, I can continue next week.

3 Likes