17 Sep 2010 @ 8:38 AM 
Bookmark and Share

Have you ever built a search using a SQL LIKE statement, only to have your users complain about functionality? A simple SQL-based search doesn’t handle synonyms, misspellings, prefixes, suffixes, result rankings, weighting, and so on and so forth. Fret no longer, you can spend a little more time and build a “smart” search using Lucene and get all of these features as well as the ability to tweak the search as much as you like.

Lucene.NET is a direct port of the popular open source Java Lucene project. Large companies such as EMC and Cisco have placed bets on Lucene and embedded the library within some of their products. The .NET version is a little bit behind the Java version in terms of features and releases, but by and large the library is very usable. Lucene can be used to index just about any type of content – including files , database records, web pages, and can be used in any number of architectural scenarios – searching in an ASP.NET web site, searching within a desktop app, search as a web service or Windows service, etc.

In the most simple search scenario – architecturally, you have to build an Indexer and a Searcher. You can think of Lucene as a set of tools that will do most of the work for you in building these components – you have to use Lucene to build an index and dump your searchable content into that index, and you have to tell Lucene how to search the index that you’ve built. Conceptually, the index is built out of the content that you want to search, whether it be files or database records. If you change the content you want to search on (for example, you’ve added a new file), then you have to either append that content to your index or rebuild your index. One strategy is to set up a scheduled process (i.e. using Quartz.NET, a windows service, or scheduled task) to periodically re-index your content.

Adding Lucene to your project

First things first, you have to add the Lucene libraries to your project. On the Lucene.NET web site, you’ll see the most recent release builds of Lucene. These are two years old. Do not grab them, they have some bugs. There has not been an official release of Lucene for some time, probably due to resource constraints of the maintainers. Use Subversion (or TortoiseSVN) to browse around and grab the most recently updated Lucene.NET code from the Apache SVN Repository. The solution and projects are Visual Studio 2005 and .NET 2.0, but I upgraded the projects to Visual Studio 2008 without any issues.  I was able to build the solution without any errors. Go to the bin directory, grab the Lucene.Net dll and add it to your project.

Building the Index

Step two is building your searchable index. A Lucene index is usually stored as a set of files on the file system, but can also be stored in memory for performance – and there are even proof of concept projects available that allow you to store the index in a database (though I’m not sure why you would).

A couple of Lucene concepts/classes you should be aware of for indexing include Documents, Fields, Analyzers, and the IndexWriter. Documents are what you put into your index. They’re not “documents” in the traditional sense, like a Word document – rather, a Document is just an abstraction of an indexable piece of content. It is your responsibility to create the Document objects to place into your Index.

For example, let’s say we’re creating a product search, using Product objects pulled from our database. Our searches will be based on the Product Name.

        public class Product

        {

            public Product() { }

 

            public string ProductName { get; set; }

            public decimal Price { get; set; }

            public string Color { get; set; }

            public string Id { get; set; }

 

            //return a Lucene document for the product

            public Document GetDocument()

            {

                Document document = new Document();

                document.Add(new Field("ProductName", this.ProductName, Field.Store.NO, Field.Index.ANALYZED));

                document.Add(new Field("Id", this.Id.ToString(), Field.Store.YES, Field.Index.NO));

                return document;

            }

        }


We’ll add fields to our Document to represent the values we want to search on or store in our Index. Field.Store.YES/NO indicates whether or not we want to actually store the field in our index. Note how I don’t store the Price or Color columns – we don’t want to store the complete objects in Lucene – it’s just our search index. Keep the complete objects stored in your database (or keep your files on the file system, etc.). We do want to store the Id because when we get our search result documents back from querying the index only the stored fields will be returned . We need to at least know the Product Id so we can go fetch our full objects that match our search results from the database. There is also a COMPRESS option that you can use if you need to store large fields or binary data.

Field.Index.ANALYZED/NO indicates whether or not we want to actually index the field. Indexing a field takes some minimal level of processing power, so we don’t want to index every field – only index what you want to search on. Thus we don’t want to Index the Product Id, Color, or Price – only the Name because that’s all we want to search on.

Next, we’ll create the index and add the documents to it. Below is an example of a very simple class with a single method that we can use to build our Product search index using a given list of products.

 

        public class Index

        {

            public void BuildIndex(List<Product> products)

            {

                FSDirectory directory = FSDirectory.Open(new System.IO.DirectoryInfo("C:\\temp\\"));

 

                Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);

 

                IndexWriter indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

 

                foreach (Product product in products)

                {

                    indexWriter.AddDocument(product.GetDocument());

                }

 

                indexWriter.Optimize();

                indexWriter.Close();

            }

        }


The FSDirectory is just an abstraction of the storage of the index, and there are “directory” classes that represent in-memory storage, etc. that you can use as well. You can pass a DirectoryInfo object to the Open method to specify where to store the search index.

The Analyzer’s job is to parse, tokenize, and index your data. There are a number of different Analyzers implemented in Lucene, but the StandardAnalyzer is the most straightforward. The StandardAnalyzer will do a few things to your text – including removing junk search terms (aka “stop words”) and punctuation, and normalizing the case of your text. There are a number of constructors available for the StandardAnalyzer, and you can specify your own stop words if you like, but there is a list of common stop words built into Lucene. There is another good analyzer available called the SnowballAnalyzer, which will remove suffixes and prefixes from your text, which can greatly improve your search results. The SnowballAnalyzer is a separate Lucene project that is outside of the main source code, it can be found under the contrib folder in the Lucene source (not in the main Lucene.Net solution) – build it yourself and include it in your project if you would prefer to use it instead of the StandardAnalyzer.

The IndexWriter is responsible for creating the index. The IndexWriter is actually thread safe, and an index can be rebuilt while being read from at the same time without you having to manage the locking of the index files. Lucene takes care of that for you. There is a boolean parameter on the constructor that indicates whether or not to recreate or append to the index. Simply call the AddDocument method on the IndexWriter to write documents to the index. When you’re finished writing documents to the index, you must call the Close method. Optionally, you can call the Optimize method before closing the index which will greatly shrink the size of the index – however, this can take a few seconds sometimes so you may not want to call Optimize if you have indexing performance concerns.

Now that we have the Index built, we can move on to actually searching the index…

Searching the Index

Below is an example method that you could use to search your newly created product search index, you could potentially add it into your Index class. You’ll see a few of the same classes from the indexing sample being used in the search method. As in the previous example, you’ll use the FSDirectory class to specify where the index is located. Then, you’ll need to create an IndexReader, passing in your directory object. The second parameter of the IndexReader specifies whether or not to open the index in read-only mode – for our simple purposes, we only need to read from the index. One thing to note about the IndexReader is that it is fairly expensive to create, so you don’t want to create one every time you’re doing a search in your web application for example. Create a single IndexReader – perhaps in a singleton pattern or by caching the IndexReader object, and re-use that IndexReader. Next, we need an IndexSearcher to actually search our index, fairly straightforward.

When searching, the search queries must be parsed and tokenized in the same way that the data was parsed when it was placed into the index. Due to this, one very important thing to note is that when searching, the same type of Analyzer that was used to create the index must also be used to parse the search queries. If a StandardAnalyzer is used to create the index, a StandardAnalyzer must also be used to parse search queries against the index. The QueryParser actually parses the query text against the field that is going to be searched against – as you can see in the QueryParser constructor, we’ll be searching against the “ProductName” field from our documents. After that, simply call the Parse method on the QueryParser to get the Query that we’ll pass to the searcher. To note, if you want to search on multiple fields – say we wanted to search on the Product Name and the Color, you can use the MultiFieldQueryParser class to query against multiple fields. With the MultiFieldQueryParser, you can even do some clever things like weighting fields differently, i.e. if I wanted product name matches to rank higher than color matches.

Next, we’ll create a collector that will define how the search results are collected from the searcher – we’ll use a TopScoreDocCollector. The first parameter is the maximum number of results, and the second parameter determines whether or not the results are sorted in order of search relevancy. For our purposes, we want to show the customers the best results for their search query so we’ll obviously want our results sorted in order. From there, simply call the Search method on the searcher, passing in the query and the document collector and receive a collection of scored matches based on the search query. For each match, you can call the .Doc method on the searcher to retrieve the actual full Document that was placed in the Index originally. After I’ve collected up the Product IDs from the search result documents, I go back and fetch the full Product objects from the database. Depending on what fields you choose to store in your Lucene index, you may not need to re-fetch what you’re searching for from the database. It’s a good idea to store only enough data to display the search results, that way you don’t need to make a trip to the database just to display your search results.

        public List<Product> SearchProductName(string productName)

        {

            FSDirectory directory = FSDirectory.Open(new System.IO.DirectoryInfo("C:\\temp\\"));

 

            IndexReader reader = IndexReader.Open(directory, true);

 

            Searcher searcher = new IndexSearcher(reader);

 

            Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);

 

            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "ProductName", analyzer);

 

            Query query = parser.Parse(productName);

 

            TopScoreDocCollector collector = TopScoreDocCollector.create(100, true);

 

            searcher.Search(query, collector);

 

            ScoreDoc[] hits = collector.TopDocs().scoreDocs;

 

            List<int> productIds = new List<int>();

 

            foreach (ScoreDoc scoreDoc in hits)

            {

                //Get the document that represents the search result.

                Document document = searcher.Doc(scoreDoc.doc);

 

                int productId = int.Parse(document.Get("Id"));

 

                //The same document can be returned multiple times within the search results.

                if (!productIds.Contains(productId))

                {

                    productIds.Add(productId);

                }

            }

 

            //Now that we have the product Ids representing our search results, retrieve the products from the database.

            List<Product> products = ProductDAO.GetProductsByIds(productIds);

 

            reader.Close();

            searcher.Close();

            analyzer.Close();

 

            return products;

        }


Again, keep in mind this is only an example method. The examples above are based around searching rows that live in a database, but they could be easily adapted to searching through a directory of files, or searching through indexed web pages. The Lucene class structure, to me seems highly abstracted – this is to allow for ultimate flexibility. Search is a finicky thing and you’ll always run into scenarios where your client doesn’t like the way the search works – that’s fine, because Lucene gives you the flexibility to change how the search works.

Posted By: admin
Last Edit: 03 May 2012 @ 11:58 AM

EmailPermalink
Tags
Tags: ,
Categories: Uncategorized


 

Responses to this post » (19 Total)

 
  1. This was very helpful, thank you for sharing.

  2. Nice article, thank you! Could you write more about searching rows in database?

  3. Getting started with Lucene.NET…

    Thank you for submitting this cool story – Trackback from DotNetShoutout…

  4. John says:

    Slava,

    The code for searching database rows would be very similar. Rather than iterating through “products” and adding them into your Index, you would retrieve rows from your database and index them – creating one Document per row and adding a new Field for each column you would like to search on. You would want to store your row’s Primary Key as a field on the document, so you could then go and fetch the rows after you’ve searched on them.

  5. [...] Getting started with Lucene.NET [...]

  6. [...] rankings, weighting, and others. John Sprunger discourses on his blog JSprunger.com about, “Getting started with Lucene.NET,” describing Lucene-based search as capable of indexing all types of content, “including files, [...]

  7. Mikael says:

    “The same document can be returned multiple times within the search results.”

    So if you ask for 10 results, you can in theory get the same item 10 times? Is this a bug or a feature? All other search products deliver distinct items.

  8. John says:

    Mikael,

    Yes – when I was writing the proof of concept I was seeing the same document returned multiple times if there was more than one hit for the search term within the document. I think it’s a feature rather than a bug – because it allows you to find all of the hits within a single document if necessary.

  9. Mikael says:

    I can see the feature side of it for sure, but makes it harder to make a 10 hits per page solution with paging. Maybe there’s an option somewhere to control this. I have’n tested Lucene much yet, but I’ll bear this in mind when looking at it :) Thanks for a great tuturial!

  10. pres says:

    Great post! Nice!

  11. FelixMM says:

    Thanks a lot for this post. I’ve spend a couple of days now looking for documentation and/or examples on Lucene.Net and this is by far the most helpful of all.

  12. I just just like the precious facts you actually offer as part of your reports. I will take a note of your web site and also look into yet again here often. We are alternatively particular I’m going to master numerous innovative products appropriate below! All the best . for!

  13. Dave says:

    This is by far the best introduction to Lucene.NET I’ve read. Excellent post. By following this I went from zero to having Lucene working in my app in a matter of hours!

  14. Je terminerai de regarder tout ça plus tard

  15. sexy says:

    Un article vraiment rempli de bon sens

Post a Comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Change Theme...
  • Users » 3
  • Posts/Pages » 16
  • Comments » 85
Change Theme...
  • VoidVoid « Default
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LightLight

About



    No Child Pages.