7 Minutes read Tech

Optimization strategies for traversals in Neo4j

Let's look for optimizations to improve the performance of graph traversals in Neo4j, by modifying the code or the platform itself.

Frank Pavageau Senior Software Architect

Cet article est disponible en français sous le titre « Stratégies d’optimisation des traversées dans Neo4j ».

This article is a follow-up of “Performance of traversals in Neo4j”.

The story could have ended after my analysis of the evolution of traversal times, but I still had a real problem with users observing degraded response times. It was difficult to argue that it was for the greater good and a better scalability in the future.

Going back to the initial problem, the cost of operations in the Neo4j core API (Node::hasLabel, for example) exploded when comparing to the version using a cache, and now represents a significant part of the execution time. What can be done to alleviate that cost?

Preliminary analysis

The task is probably bound by reads on disk, since we’re in a database. The best optimization would be to remove some parts of the task if they aren’t actually needed, instead of trying to make them run faster.

What is our traversal doing exactly? It goes through each node in the tree, the Evaluator component checks if it’s of type B and if the value of its property is true then always continues the traversal, and then the PathExpander component checks if it’s of type A or B to fetch the outgoing relationships that are appropriate. That means between 2 and 3 calls to Node::hasLabel per node, whether it’s of type A or B. There’s also a call to Node::getRelationships per node to expand the traversal, and a call to Node::getProperty per node of type B. It would be difficult to do less functionally, and there’s unfortunately no code to cut.

![Flowchart of the traversal per node](//images.contentful.com/95wnqgvmhlea/7jJnD0k0ggGW0KickYsw06/4764d5f96f0374fee4e5769e632ab4c6/evaluator-td-en.png “”)

Let’s use a profiler to find out the cost distribution of the calls to Node::hasLabel during a traversal. I’m exploring incrementally using the various modes of Yourkit.

First in CPU sampling mode, that only has a small overhead. Node::hasLabel (or more precisely, its implementation NodeProxy::hasLabel) is measured as contributing 19% to the traversal, with its own calls distributed like:

16% of OperationsFacade::labelGetForName, to get the label index from its name
84% of OperationsFacade::nodeHasLabel, which itself calls another single method, which calls another single one, etc. with a decreasing relative cost (each method has its own cost, added to the cost of what it calls). The descend ends up with methods such as NodeStore::loadRecord or MuninnReadPageCursor::next, which do look like disk reads (load, read page), so it’s consistent with what we expected.

![Call tree sampled by Yourkit](//images.contentful.com/95wnqgvmhlea/5yGjoDtJbUUSYySaa42wKu/02eceee922eb7b5119f9248d5d28aa3c/hasLabel-baseline-sampling-1.png “”)

Let’s then use the CPU tracing mode, which is more invasive but gives us the number of calls and a finer measure of the method durations. Node::hasLabel now represents 28% of the traversal (observation changes the measure), and its cost distribution is:

7% of OperationsFacade::labelGetForName
73% of OperationsFacade::nodeHasLabel
19% of its own code
1% elsewhere

It’s still data reads, but we now know that there are 3 millions calls to Node::hasLabel, which match the expected 2 to 3 calls per node.

![Call tree instrumented by Yourkit, without capturing the memory allocations](//images.contentful.com/95wnqgvmhlea/1ZZhmTCdBaC0om8k2ECAWm/f0b6f23bd7513e3a37fd64ad0ae90937/hasLabel-baseline-tracing-2.png “”)

Finally, by setting both the CPU tracing mode and the capture of memory allocations, we get a view combining the CPU cost with the memory flow, at the cost of higher bias and execution time. Node::hasLabel now contributes 42% to the traversal:

2% of OperationsFacade::labelGetForName
92% of OperationsFacade::nodeHasLabel
6% of its own code

![Call tree instrumented by Yourkit, when capturing the memory allocations](//images.contentful.com/95wnqgvmhlea/3UNmqLhKRaWCau4yIUM2eM/8b41ddaed981239e1b2dbf10e4b4b609/hasLabel-baseline-tracing-mem.png “”)

Obviously, reads still constitute most of the cost. Memory allocations for the request are measured at 76 MB, distributed like:

615,155 instances of long[], i.e. 14 MB
307,591 InlineNodeLabels and as many Bits, i.e. 7 and 9 MB respectively
139,822 int[], i.e. 3 MB
139,818 NodeProxy, NodeProxy$2 and TraversalBranchImpl, i.e. 3, 3 and 5 MB respectively
139,808 RelationshipType[], RelationshipConversion and RelationshipProxy, i.e. 3, 4 and 4 MB respectively
etc.

![Allocations mémoire capturées par Yourkit](//images.contentful.com/95wnqgvmhlea/6d21vGGg2k8oKIg6IIKa04/4f726a7797d0fcec80e7c8eadf7cfed7/hasLabel-baseline-mem.png “”)

100% of the long[] allocated during the request are actually caused by Node::hasLabel! If we can limit the number of calls to the method, there is potential for improved allocations.

Potential optimizations

Labels

With 2 to 3 calls to Node::hasLabel per node, 1 to Node::getRelationships and 0 or 1 to Node::getProperty, we can’t do much for a single request. However, inspired by an article by Max de Marzi (@maxdemarzi, Support Engineer at Neo Technology) about speeding up traversals, we can explore the possibilities if we cache data for all requests. In the real application, we have around 10 calls to Node::hasLabel per node anyway, there would be more opportunity to amortize the cost of a cache local to the request.

I’ll try to replace disk reads and the related allocations by another type of allocation, and see if there’s a benefit.

The simplest approach would be to cache with a composite key based on the node and the label, but it involves a lot of objects to finally get a simple boolean, and we might do better. By looking at the implementation of Node::hasLabel (subject to change, though), we can observe a simple iteration on the labels till the requested one is found: it doesn’t use a more evolved structure (inverted index, hash map, etc.). Therefore, if a node has *n* labels, the search for a label will statistically iterate over *n/2* nodes before stopping, and 2 calls to Node::hasLabel will iterate over the equivalent of *n* nodes, which is what Node::getLabels does. Of course, Node::getLabels adds an overhead since it creates the collection of labels instead of returning a simple boolean, but if Node::hasLabel is called more than twice on average, using Node::getLabels to get all the labels once for all and store them in a structure optimized for search (a Set for example) could improve the situation.

Actually, a Set is fine, but why not go even further? What’s a Label? Its API is really simple:
public interface Label { String name(); }
It looks just like an Enum, and that’s actually a known implementation pattern:
enum Labels implements Label { A, B, Root }
As a Set, we could specifically use an EnumSet, more efficient as it’s based on the ordinal value of the enum. Continuing in the direction, a label present in the enum is equivalent to an index (its ordinal value), which can be encoded using bit masking: the *n*-th bit represents the presence of label *n* on the node, whether its value is 0 or 1. That’s exactly what I’m going to do: since we only have 3 labels, we can do bit masking on a byte et cache the labels of a node (or more precisely, its Neo4j identifier, a long) in that byte.

Since we’re already using fastutil in the application to get primitive collections (which use less memory than their object counterparts), I’ll use a Long2ByteOpenHashMap for the cache.

Properties

As properties can be of a number of types (boolean, number, string, or an array of one of these types), we can only cache the read properties in a Map<String, Object>.

However, I can use a trick due on the fact that there’s only one property per node in the example, and use Collections::singletonMap to cache the first property, before falling back to a HashMap once there are 2; it’s much more compact in memory.

Relationships

Caching relationships is much more complex, as the simplest solution would be to only cache the identifiers of relationships and reload them on the fly. It’s not so easy to implement, with limited potential gains. I decide not to implement it, as it’s not even a scenario which can be optimized in the application, where the same relationship is hardly ever traversed more than once.

Measuring the effects

All this is implemented in a new branch of the example, using facades to allow independent activation of the caches (via query parameters).

All measures are done on a Neo4j Enterprise 2.3.2 instance with a 4 GB heap, and a traversal in depth-first mode. Neo4j runs on my 15-inch, mid-2014 Mac Book Pro with an Intel Core i7 2.2 GHz processor using 4 cores, and 16 GB of RAM, running under OS X 10.11.4 (El Capitan), and with a 64-bit, 1.8 u92 Oracle JVM.

The label cache gives a higher gain than the property cache, and combining the two improves the response times by around 30%.

Scenario	Reference	Label cache	Property cache	Both caches
Mean	1540	1209	1387	1067
50%	1547	1207	1384	1053
90%	1693	1229	1427	1100
95%	1716	1241	1432	1192
Maximum	1757	1284	1447	1248
Used heap (MB)	24	42	40	58
Allocations of the request (MB)	76	46	73	43

![Histogram of the response times for the 4 scenarios combining the caches, showing the best improvement with both active](//images.contentful.com/95wnqgvmhlea/7yhXIJu9Oguik8SQcqIMMK/3d4511c7bafd9e2a70fb21cb6372f325/optims-histo-en.png “”)

The effect is largely positive, with a limited increased of the used memory. The allocation of the permanent caches during the first request (18 MB for the label cache, 16 MB for the property cache) is more than compensated by the reduced allocations in a single request after the first one: the label cache has the most important impact by eliminating 30 MB of allocations (about 40%), where the property cache only eliminates 3 MB. The memory pressure is reduced, there are fewer garbage collections, fewer pauses, etc.

![Position of the 4 scenarios by static and dynamic memory usage](//images.contentful.com/95wnqgvmhlea/1r8JwK3FsAMcEkmuKi8m4m/5531e0d947025ff7f7fc9456cd2c07bc/optims-labels-en.png “”)

How about not modifying code?

When you’re using a platform, another strategy is possible, which is to take advantage of the low-level optimizations of that platform, when available. Since we migrated to version 2.3, Neo4j 3.0 was released. It’s a new major version, with both functional and technical evolutions. Independently of the upgrade plan of our application to 3.0, let’s have a look at what it brings through the example.

Actually, we’re based on a second platform: the JVM. I neglected that part when we migrated from version 2.1 to 2.3: the garbage collector (GC) configured by default in Neo4j changed from Concurrent Mark Sweep (CMS) to Garbage First (G1), and we also used the migration to upgrade from JDK 7 to JDK 8. However, even if G1 is supposed to supersede CMS, a lot of users are experiencing mixed results (likeElasticSearch or its users for example), and it has seen lots of fixes up to recent versions of the JDK. We have a synthetic benchmark, not representative of a production situation, but I’ll use it anyway to explore how it’s affected by the GC algorithm.

As during the comparison between versions 2.1 and 2.3, the same jar containing the unmanaged extension is deployed on different versions of Neo4j Enterprise with the same configuration: the heap is still at 4 GB, and the traversal is done in depth-first mode, with no active cache.

If we focus on the cases using G1, a degradation could be observed with version 3.0.0-RC1 (Release Candidate 1), but the final version actually shows a slight improvement of the response times over 2.3.2. A spectacular gain can be seen starting at version 3.0.2, where response times are reduced by around 25%. Version 3.0.3 even manages to score a little better.

However, when comparing G1 and CMS, CMS gives better response times in almost all cases, with a huge difference in 2.3.2 where response times are 25% lower. Since the results have converged in 3.0.2 and 3.0.3, we can imagine that the improved performance comes partly from reduced allocations, which would decrease the weight of the garbage collector in the equation.

Version	2.3.2	3.0.0 RC1	3.0.0	3.0.1	3.0.2	3.0.3
GC	G1	CMS	G1	CMS	G1	CMS	G1	CMS	G1	CMS	G1	CMS
Moyenne	1438	1100	1521	1461	1367	1281	1406	1252	980	1027	1067	993
50%	1395	1098	1517	1429	1365	1282	1386	1251	980	992	1067	992
90%	1596	1131	1548	1616	1384	1294	1498	1264	990	1125	1075	1006
95%	1619	1140	1556	1654	1391	1298	1560	1268	994	1134	1078	1010
Maximum	1694	1168	1637	1735	1409	1323	1630	1305	1006	1163	1120	1049

![Histogram of the response times, by version and GC, showing a continuous improvement](//images.contentful.com/95wnqgvmhlea/1xd8tgEFeQQI2gqc4WYqO0/79de7350199022e0d6f5f7994715e324/versions-histo-en.png “”)

I didn’t compare (yet) the versions with a profiler, but obviously the engineers at Neo Technology did a good job, even on minor 3.0.x versions.

Conclusion

Based on these results, I applied a slightly different label cache to the real application:

it uses a larger type than byte (as we have more than 8 labels)
the cache is local to the HTTP request, and not permanent as in the example

Making the cache local means we don’t have to manage its invalidation, its memory footprint, or the related pauses. The result is positive nonetheless because for a single node, we can get around 10 label calls (to classify it, find which relationships to expand, filter it, etc.). The result is especially significant for allocations. With that single optimization, response times have been reduced by 23%, bringing them back to an acceptable level, even it they remain greater than the “perfect” measurement with Neo4j 2.2.

Finally, the lower allocations and the decreased GC pauses give us more stable response times, but the GC algorithm itself should not be neglected either. However, a single benchmark is not representative for that, and G1 and CMS should really be compared with real traffic on the application before making a choice.