Why use CSharpQuery?

A short time ago I was working on a project using SQL Server Full Text Search. This was my first real deep exposure to the engine and it worked pretty well. However, for the project I was working on there were some very serious problems that I was never able to overcome. Try as I might to tweak SQL Server Full Text Search I was never able to tweak it as much as I needed.

Here are some of the problems that I faced while playing with the Full Text Engine:

  1. SQL Server Full Text Engine 2005 (FTE) had some scalability issues. I didn’t so much need it to index hundreds of millions of rows (although it needed to be able to do a lot of records) but mostly I need a lot of queries per second. It seemed to only be able to run on a single processor at at time. If someone did a ridiculously huge query everyone else on that server would come to a halt! On a box with 8 logical processors this was not acceptable!
  2. FTE has no capability for compound words (i.e. the term ‘Riverboat’ and ‘River Boat’). Sure, you could put such pairs in the Thesaurus as expansions, right? Well, I will get to that.
  3. FTE black boxed the word breaker. You might think that this is not a big deal but you would be wrong! FTE considered the ampersand ‘&’ a word breaker when I needed it to not be. For example, in FTE if you did a search for ‘R&B’ the ampersand would break that into the words ‘R’ and ‘B’. Both of such are in the noise words list by default. Therefore, the terms ‘R&B’ and ‘AT&T’, etc, were optimized out of the query by design. Creating your own word breaker is possible but very difficult and not recommended. Also, I needed an underscore ‘_’ to break words and it did not.
  4. The ranking information that came back from FTE was not very good. This is because the ranking could not be used to compare two different queries and also because the ranking data was not very good. The numbers were not very evenly distributed. IE, I might have the top 30 rows with a rank of “34”, and the rest had a rank of “20”. The numbers are also arbitrary and meaningless.
  5. The ‘FREETEXTTABLE’ term is useless! It will only UNION sets rather than INTERSECT them. This means that a search for ‘Sax’ could return 1,254 rows while the term ‘Tenor Sax’ would return 2,259 rows. Every term you add will increase the size of the result rather than decrease it. We had to use the ‘CONTAINSTABLE’ search term but that led to problems with any compound word in the thesaurus and looked awful! Something like: (“Sax” OR “Sax*” OR FORMSOF(“Sax”) OR THESAURUS(“Sax”)) That is for a one term word. Each word would need it’s own set of criteria.
  6. FTE was kind of slow on the larger queries. Returning a set of over 1,000 seemed to be quite a chore! In order to add things to the thesaurus you had to change the XML file and restart the service. You may even have to rebuild the index – I don’t remember.
  7. Every few days or so the searches would slow down a lot! In order to get it speed back up we had to shut down the FTE service and the SQL Server service and rebuild the index. We hoped to write a script to do this chore every so often but for some reason the script that Management Studio generated didn't seem to work the same way.

That's all well and good, but what did I gain by writing my own version?

  1. Most of the code is reentrant and the code that is not uses highly efficient reader/writer locking to make sure the index doesn’t change out from underneath you! This means that I can fully utilize all logical processors on the machine and large queries will not interfere with smaller ones.
  2. It also means that the index can be rebuilt while the old index is still in use. Once the build is complete they can be quickly and easily swapped.
  3. I was able to create ranking that yielded very nice even distribution between every result in the set. (more on this below)
  4. I was able to pull word breaking terms from a file. There is also one file per language. These files are also used to break the search term that was used to create the index. Because it uses the exact same word breaker on both the search terms and the index data we got much better search matching! Terms like ‘R&B’ were indexed and searched for in the same way. Noise words would be dropped the same way as well. It made working with compound words possible.
  5. This search is much faster than SQL Server FTE. All my queries returned in less than 100ms, usually much less!
  6. Ranking could be customized. (more on this below)
  7. I was able to create a ‘FREETEXTTABLE’ style query that reduced the number of results as the search term became more specific.
  8. I could change the thesaurus on the fly.
  9. I was able to manage compound words much more effectively.
  10. I got to implement my own custom Binary Search which was pretty fun!

Last edited Nov 18, 2009 at 9:50 PM by NathanZaugg, version 3

Comments

No comments yet.