The real genius in the CSharpQuery code is the ranking. As mentioned above the ranking in FTE has two fatal flaws. It doesn’t allow the rankings to be used outside of the result set and the ranking values is pretty scant. In a typical result set of 100 rows, there would only be maybe 5-10 distinct numbers for ranking data. This means that there are large sections that appear unsorted or unranked! This is a pretty difficult problem to come up with a ranking algorithm that is both universal in it’s application and provides a ranking so unique and precise as to almost be synonymous with a hash.

This is accomplished by using these four filters:

  • Word Proximity – How close are the search terms to each other?
  • Multiple Occurrence – How many times does the search term appear in the indexed phrase?
  • Low Phrase Index – Are the search terms found near the beginning of the phrase, perhaps in a title?
  • Word Matching – Did we find the exact words or did we use the thesaurus or front word matching to find this row?

The ranking of results is the most complex part of the code. It is also the most process intensive. Each row is ranked independently based on the 4 different filters. Each filter ranks the row between 0 and 1. To produce the final rank, each filter result is weighted as some filters are better at finding what you are looking for than others. Based on this setup some of the less discretionary filters can be used to break ties. This results in a very nice even distribution between the entire result set. The final ranking number is also a number between 0 and 1.

Last edited Oct 16, 2009 at 7:40 PM by NathanZaugg, version 2


No comments yet.