Compound Words

Another great feature of CSharpQuery is the tweaked thesaurus for compound words. If you do a search for "Bull Fight" and "Bullfight", I would expect:
  1. Any time the term "Bull Fight" was found the term "Bullfight" would also be found.
  2. I would expect, that because this is defined as a compound word explicitly, that the results would not match the text "I got into a fight with a bull".
  3. I would expect that the search would yeild the same number of results with either search (I.E. the search should be identical)

Lets say that you were looking for a song titled “Riverboat Shuffle”. In SQL FTE if we were to simply do a thesaurus lookup on “Riverboat” we’ll also get “River Boat”. This means it will return all results with river and boat but not necessarily together. In my version of the thesaurus the exact same results are returned from the queries “Riverboat Shuffle” or “River Boat Shuffle”.

To define a compound word we simply need to add an entry into our "Thesaurus.global.xml" or "Thesaurus.en-US.xml", etc. Remember, it's important to put the compound word in the bottom expansion, they should only appear in pairs of two, and there should not be duplicates even between the global and culture specific thesaurus.
<XML ID="Microsoft Search Thesaurus">
  <thesaurus>
    <expansion>
      <sub>bull fight</sub>
      <sub>bullfight</sub>
    </expansion>
  </thesaurus>
</XML>


This is also how you would add regular thesaurus lookups:
    <expansion>
      <sub>mad</sub>
      <sub>angery</sub>
    </expansion>


NOTE: The thesaurus that comes with this download is not necessarly a great one. Basically I used a dictionary to find compound words but discovered that there are a lot of words that to a computer look like a compound word but are actually not. Such an example is “monkey” was found from “mon” and “key” but “mon key” is not the same as “monkey”. It is your responsibility to clean up the thesaurus. If you would like you can also send it back to me when your done so everyone else can benefit.

Last edited Oct 16, 2009 at 6:40 PM by NathanZaugg, version 2

Comments

No comments yet.