Friday, March 2, 2012

NoClassDefFoundError vs ClassNotFoudException

I hit one tricky NoClassDefFoundError recently and I ended up searching the web quite intensively about my issue. Interestingly, I realized that many people are questioning about the difference between NoClassDefFoundError (NCDFE) and ClassNotFoundExcpetion (CNFE), and they are confused, they are falling into harsh discussions on blogs, they create many artificial Java code snippets to prove they are right and the others are wrong. But in the end I think they are missing the important point about these two exceptional states in Java.

So let me contribute to this discussion today.

To me the most important point is that one of them is Error, the other one is Exception. According to JavaDoc definition, application may not want to recover from Errors as they are usually fatal, but application can try to recover from Exceptions (more specifically, application may consider to catch Exceptions).

Given this if you want to sum up the difference between NCDFE and CNFE in short then I would say that CNFE (the Exception) is fired when there might be a reason to recover from the situation - for example you try to load particular class from classpath but it is not found, so you might want to consider loading a different class instead.
Contrary NCDFE (the Error) is fired when you have no chance to recover, it is too late, that means you directly try to make use of the class that is not available (as JavaDoc says, you either call constructor or make a method call). And this does not say nothing about whether the class is on classpath or not (it can be available but loading fails due to broken static initialization for example).

In my particular case the class was NOT located on the classpath (you can find option -verbose:class useful in such situation). May be you ask, why I did not get the CNFE Exception first instead? Well, that is a good question and one possible option is that such Exception could haven been swallowed silently earlier in the code.

By no means this is a detailed analysis of differences between NCDFE and CNFE but in all the discussions on the web I was simply missing such highlevel point of view. And in some cases it is important to step back and see things from further distance.

Comments welcome.

Monday, June 6, 2011

Distributed search with Solr - some notes from #BBUZZ 2011 presentation

I am in Berlin now attending Berlin Buzzwords 2011 conference and I am really glad to be here. I have just came back from presentation called "Distribute Search of Heterogeneous Collections with Solr" by Andrzej Białecki and I feel like putting down some notes (so why not to post it on the blog).

First of all, I like Andrzej's presentations (this is the second time I saw him speaking life, the first time it was last year in Prague during EuroCon 2010). Now, he was talking about some aspects (and challenges) of distributed search with Solr. I am not the Solr user (as you can guess from the title of my presentation that I am going to give tomorrow) and I know very little about how it works internally, so Andrzej's presentation was quite good opportunity for me to learn some bits about it.

Generally speaking, Solr is able to do distributed search but I think that it is fair to say that it is lacking a lot of features that are already available in ElasticSearch out of the box. The following are some of them that I remember from the presentation:

  • Merging scored results from different shards (especially if the documents are evenly distributed among them) can dramatically impact (in negative terms) correct ranking of final results. There are ways how to cope with this, probably the most straightforward one is to issue one query in advance to collect needed data (like global frequencies) to do the calculation correctly. As far as I understand Solr can not do this out of the box now (Andrzej did mention some path, however, I do not remember if it was really related to this point). Moreover, it is questionable if Solr can cope with changes that can happen in index between two requests in this case (ie. some document can be added or deleted in between). ElasticSearch on the other hand has search types that can do exactly this, and you can decide on a request basis which calculation method you want to use. And if I remember correctly (as I was asking this on ML some time ago) it can cope with changes in index correctly if more then one request is needed to finish the client request.
  • Solr can not return partial results. This means that if any of the shards fail during distributed request then client does not get any data back. ElasticSearch tells you how many shards failed and how many succeeded in every response. Then it is up to client to decide what to do with partial results.
  • Shard latency seems to be another opened challenge for Solr right now. This means that the search request is as slow as the slowest shard in its cluster. As far as I understand ElasticSearch is able to do something about this because you can specify timeout pre request and you can also change (in this case increase) number of replicas for shards at any time. Not sure if Solr has any of these.

Generally, I do not want to judge which system is better as there can be other areas which I am not aware of where Solr could be better then ElasticSearch but my impression is that by far distributed search it is not.

BTW: comments are welcome (although be aware that comments do undergo moderation).

Tuesday, August 3, 2010

How to upload .OGV video file to Vimeo

Vimeo does not support .ogv video file format. So when I wanted to upload video file produced gtk-recordMyDesktop to Vimeo then I had to find a way how to convert it into some format that is supported. The following is a link to blog post that saved my day:

If the above link does not work then the magic command follows:

mencoder -idx input.ogv -ovc lavc -oac mp3lame -o output.avi

Tuesday, June 8, 2010

Search, Network, Analytics

... SNA in short. It is a bunch of interesting projects from LinkedIn R&D team related to full text search (Lucene), distributed systems (Hadoop, ZooKeeper, ...), analytics and other cool topics. Can be found here:

Monday, April 12, 2010

Google not identifying pages with the same content correctly?

Although identification of web pages with the same content has been the target of active research for many years now it seems that even Google is having issues dealing with this problem correctly.
I am trying to build GWT from the source now and for some reason the simple procedure described here: does not work for me. Thus I set on googling for building gwt from source but it turns out that the first page of results contains more then four hits to the same mail thread (or even to the same individual emails in this thread). In other words half of the top 10 results points to the same content.

The following are the hits from top 10 results:

Not sure this is what I would expect!

Yes, extracting content from public email lists can be very hard especially if the indexer does not have access to the source of the text but only to the various HTML representations of the same message but I thought that there are techniques how to deal with document similarity: e.g. MinHash (see Duplicate Detection )

Thursday, April 8, 2010

Apache Lucene EuroCon 2010 - Lucene in Prague!

Apache Lucene EuroCon 2010 will be the first dedicated Lucene and Solr User Conference in Europe. And not not only Lucene and Solr but also other projects from Lucene umbrella like Nutch, Tika, Mahout ... etc. This will be the ultimate opportunity to meet some of Lucene ecosystem committers.

Prague, May 18-21, 2010.

More info: