Tuesday, August 3, 2010

How to upload .OGV video file to Vimeo

Vimeo does not support .ogv video file format. So when I wanted to upload video file produced gtk-recordMyDesktop to Vimeo then I had to find a way how to convert it into some format that is supported. The following is a link to blog post that saved my day:

If the above link does not work then the magic command follows:

mencoder -idx input.ogv -ovc lavc -oac mp3lame -o output.avi

Tuesday, June 8, 2010

Search, Network, Analytics

... SNA in short. It is a bunch of interesting projects from LinkedIn R&D team related to full text search (Lucene), distributed systems (Hadoop, ZooKeeper, ...), analytics and other cool topics. Can be found here: http://sna-projects.com/sna/

Monday, April 12, 2010

Google not identifying pages with the same content correctly?

Although identification of web pages with the same content has been the target of active research for many years now it seems that even Google is having issues dealing with this problem correctly.
I am trying to build GWT from the source now and for some reason the simple procedure described here: http://code.google.com/webtoolkit/makinggwtbetter.html#compiling does not work for me. Thus I set on googling for building gwt from source but it turns out that the first page of results contains more then four hits to the same mail thread (or even to the same individual emails in this thread). In other words half of the top 10 results points to the same content.

The following are the hits from top 10 results:

Not sure this is what I would expect!

Yes, extracting content from public email lists can be very hard especially if the indexer does not have access to the source of the text but only to the various HTML representations of the same message but I thought that there are techniques how to deal with document similarity: e.g. MinHash (see Duplicate Detection )

Thursday, April 8, 2010

Apache Lucene EuroCon 2010 - Lucene in Prague!

Apache Lucene EuroCon 2010 will be the first dedicated Lucene and Solr User Conference in Europe. And not not only Lucene and Solr but also other projects from Lucene umbrella like Nutch, Tika, Mahout ... etc. This will be the ultimate opportunity to meet some of Lucene ecosystem committers.

Prague, May 18-21, 2010.

More info: http://www.lucene-eurocon.org/

Sunday, March 7, 2010

Finanční gramotnost - literatura pro školy

Note: This post is in Czech language only.

Našel jsem velice pěknou metodickou příručku pro učitele o finanční gramotnosti. Celý manuál je volně dostupný na webu pod licencí CC.

Lukáš Hula: Finanční gramotnost - úlohy a metodika
http://www.ceskaskola.cz/2010/01/lukas-hula-financni-gramotnost-ulohy.html

Friday, February 26, 2010

Data-Intensive Text Processing with MapReduce

Jimmy Lin and Chris Dyer released full draft of their manuscript[1] about "... MapReduce algorithm design, particularly for text processing applications."

Although it is not explicitly about Hadoop programming it uses Hadoop for algorithm design. This text is very good counterpart of Tom White's book.

[1] http://www.umiacs.umd.edu/~jimmylin/book.html

Thursday, February 25, 2010

SVNSearch

SVNSearch.org is interesting tool for searching SVN repositories. It can provide interesting statistics. You can find many projects from JBoss, SUN, Apache, SpringSource ... etc. there. Check the following blog post to learn how SVNSearch can be used: http://simplericity.com/2008/08/16/1218895920000.html

Saturday, February 20, 2010

Ruth Sanderson

Ruth Sanderson's work is really inspiring! http://www.goldenwoodstudio.com/

She has been illustrating books for over 30 years and she developed respectable technique. Check step-by-step demonstration of her work: http://www.goldenwoodstudio.com/index.php?page=artist-at-work (also check the time it took her to do some of these illustrations! - more then month for one high quality piece).

Found via lines and colors: http://www.linesandcolors.com/2010/02/19/ruth-sanderson/

Wednesday, January 27, 2010

JBoss Snowdrop Lightning Talk Uploaded

I uploaded my JBoss Snowdrop Lightning talk on SlideShare: http://slidesha.re/dh3nPu

Friday, January 22, 2010

Google Wave? Yes, finally I got the point!

I like Google Wave. Believe me or not but I really got the point! (or at least one of the points)

When I started in Red Hat / JBoss two months ago I soon realized that this company is different then other companies in many ways. One important aspect of this change is that there are a lot of very opened internal communication channels. One of the core communication technology is "a good old mailing list" (there are lot of internal mailing lists). This is typical for open source community to use this communication technology. This is really nothing new. But the problem is how to catch up with the information in mail threads when the communication gets very intensive and spontaneous. I can tell you that there are specific topics which are guaranteed to generate a lot of traffic in our internal mail lists (example Subject: VMware to acquire SpringSource or Subject: Bottom posting sucks! - see below regarding bottom posting). The problem is that there are no rules how to format text in respone to particular email in mail thread. Some people tend to prefer top posting, some bottom posting and some do reply directly in between cutting some of the original text out as well. Then depending on email client settings the original text may start with '>' or just uses different color, etc. Even if you force people to adapt to some rules and use the same email client (which is not the case in Red Hat) then you always have to train new employees which can be difficult and inefficient. So you can imagine that the information can mess up very quickly. And the best solution to this problem I have seen so far is Google Wave!

I think that the reason why Google developed and released Wave is the fact that they are facing this same problem internally as well so they decided to fix it not by regulation (i.e. establishing rules for communication participants) but by innovation (i.e. technology improvement of the communication channel). Also I can now understand why they called it Wave (as it can really deal with information waves).

I think it will take some time for more people to realize that there are big advantages in Wave. We are still too used to use email in an traditional way (yes, email is good but it has limits!). I wouldn't be surprised if open source community would be one of the first adopters of Wave.

Email allows you to send a message but Wave allows more complex communication. Wave is the future.

Thursday, January 21, 2010

JBoss Snowdrop Lightning Talk

I am working on lightning talk about JBoss Snowdrop. I will be presenting this on the upcomming CZJUG: http://java.cz/article/czjug-leden-lightning-talks

You are welcome to send me questions. Just send me an email (see my Linkedin profile) or put it into this blog post comment and I will try to address them in the presentation.

I plan to prepare slides in english but the talk will be probably in czech language.

Tuesday, January 12, 2010

Solr 1.4 Enterprise Search Server and other Lucene related books

I am reading a book called Solr 1.4 Enterprise Search Server now. I have a plan to seriously dive into Solr soon and as of now this is the first and only book about Solr on the market (so one can say it is both the best and the worst book about Solr now). I am not going to give you any review on the book as there are other (more valuable) reviews and related sources. The following are some of them:

Probably the best book review so far (very detailed) has been published by Eric Hatcher at Lucid Imagination blog: http://www.lucidimagination.com/blog/2010/01/11/book-review-solr-packt-book/

I did not know that there is an interesting apache wiki page PacktBook2009 listing missed content.

The most important review of the book for me was Grant Ingersoll's post on his blog: http://lucene.grantingersoll.com/2009/09/24/review-of-solr-1-4-enterprise-search-server-part-1/ By far this is not detailed review and moreover it is just first-part of the review (will the second-part be ever posted?) but it made me want this book.


How about future Solr books?

I think that Otis Gospodnetić (sematext.com) is working on his Solr in Action book. Given that Otis is experienced book author I would expect that this will be a MUST for every Solr fan.


How about other interesting search related books?

I am looking forward several interesting books to be finished, started or at least announced:

Taming Text
(this book has been in hibernation for some time - it seems that Grant is very busy)

Mahout in Action (I still don't understand why the scope of this book does not cover full Mahout API. This makes the book unfinished.)

Lucene in Depth (it seems that Jake Mannix is working on a new book about Lucene: An advanced topics book to go beyond Hatcher, Gospodnetic and McCandless' awesome Lucene in Action introductory text. This sounds cool!)