Monday, June 6, 2011

Distributed search with Solr - some notes from #BBUZZ 2011 presentation

I am in Berlin now attending Berlin Buzzwords 2011 conference and I am really glad to be here. I have just came back from presentation called "Distribute Search of Heterogeneous Collections with Solr" by Andrzej Białecki and I feel like putting down some notes (so why not to post it on the blog).

First of all, I like Andrzej's presentations (this is the second time I saw him speaking life, the first time it was last year in Prague during EuroCon 2010). Now, he was talking about some aspects (and challenges) of distributed search with Solr. I am not the Solr user (as you can guess from the title of my presentation that I am going to give tomorrow) and I know very little about how it works internally, so Andrzej's presentation was quite good opportunity for me to learn some bits about it.

Generally speaking, Solr is able to do distributed search but I think that it is fair to say that it is lacking a lot of features that are already available in ElasticSearch out of the box. The following are some of them that I remember from the presentation:

  • Merging scored results from different shards (especially if the documents are evenly distributed among them) can dramatically impact (in negative terms) correct ranking of final results. There are ways how to cope with this, probably the most straightforward one is to issue one query in advance to collect needed data (like global frequencies) to do the calculation correctly. As far as I understand Solr can not do this out of the box now (Andrzej did mention some path, however, I do not remember if it was really related to this point). Moreover, it is questionable if Solr can cope with changes that can happen in index between two requests in this case (ie. some document can be added or deleted in between). ElasticSearch on the other hand has search types that can do exactly this, and you can decide on a request basis which calculation method you want to use. And if I remember correctly (as I was asking this on ML some time ago) it can cope with changes in index correctly if more then one request is needed to finish the client request.
  • Solr can not return partial results. This means that if any of the shards fail during distributed request then client does not get any data back. ElasticSearch tells you how many shards failed and how many succeeded in every response. Then it is up to client to decide what to do with partial results.
  • Shard latency seems to be another opened challenge for Solr right now. This means that the search request is as slow as the slowest shard in its cluster. As far as I understand ElasticSearch is able to do something about this because you can specify timeout pre request and you can also change (in this case increase) number of replicas for shards at any time. Not sure if Solr has any of these.

Generally, I do not want to judge which system is better as there can be other areas which I am not aware of where Solr could be better then ElasticSearch but my impression is that by far distributed search it is not.

BTW: comments are welcome (although be aware that comments do undergo moderation).