Insights into what moves us. Contributions from the Structr team and guests. The Structr Blog

Christian Morgner
18. July 2012

Lucene vs. Cypher

Today I investigated the poor performance of particular queries of the structr REST server when dealing with large datasets. We are using Lucene for querying the database. So, it turns out that one of the biggest disadvantages of NoSQL is - well - there's no SQL. There are no handy statements like LIMIT, OFFSET or ORDER BY, no server-side query optimization. If you need to sort and page your result set, you have to do it on your own. In our case, that means we have to take all the nodes of a certain kind, sort them, and throw away 99.99% of the result set in order to return the first 10 results for page 1.

As a result (and this is really hard to understand for people who come from the SQL world), returning the first 10 elements of a collection of 100.000 is just as slow as returning ALL the elements. Of course, in our case, this involves instantiating all the nodes of the list in order to collect the entity-specific properties, settings, converters etc., so it might not be a problem for other Neo4j-based implementations.

However, the solution for today's particular problem - and for many other problems in this domain I believe - is the Cypher Query Language, as it provides nearly the same set of keywords SQL provides. You can LIMIT the size of your result set, you can ORDER the nodes BY any property, and you can SKIP an arbitrary number of elements! Plus, is really fast, because it is much closer to the real data, integrated deeply into the Neo4j core.

When I did the first comparisons, I found out that using Cypher as a replacement for our Lucene-based queries, we can get the results out as much as three times faster than before. And it is way more flexible than the Lucene engine (which is optimized for other things, to be fair..)

So if the future of Neo4j lies in the Cypher Query Language, I'm in for sure! :)