Thursday 11 May 2017

akka-remote and akka-cluster 2.4.x "network" issues

We've faced interesting issues with cross node delivery time in akka-cluster environment.

Out network is easy support 1 Gbit/s traffic, when it reaches 200+ Mbit, the

actoreRef ! Message

to another cluster node took 30+ seconds.

Unfortunately akka doesn't support any instruments for latency monitoring and we started from network blaming.

Using the Wireshark we found that in some moments of time our system was producing up to 30 messages a ms. On TCP level delivery time was <=1 ms, but on akka level it sometimes jumped to > 30 seconds.

After some investigation we found the root case:

akka does serialisation in Single Thread, that means that it could be easy become bottleneck, for multi core servers, when you can produce more messages - then 1 Thread could serialise.

1. Workaround: try to migrate to akka 2.5.0 and use inbound and outbound lanes feature. Be aware that documentation in akka starts from # WARNING: This feature is not supported yet. Don't use other value than 1. Use it on your own risk.
2. How to diagnose the issue - we are using small hack to measure the trip time. In our serializator used for akka-remote we inject serialised time and de-serialised time - the difference is travel time between nodes.