Umbraco CMS 4.7.2: Enabling Stemming Search and Highlighting

0

Umbraco CMS, created by Umbraco, is a great nifty and popular ASP.NET v4 based CMS system that is completely free and open source with a great active community backed by Umbraco staff. Althrough Umbraco does not make any money from selling its CMS, the company offers paid training and support.

Apache Lucene.Net

Its CMS search and indexing is provided by Apache Lucene.Net, a C# port of the world-famous and powerful Apache Lucene engine. This core search and indexing engine is wrapped by Examine and UmbracoExamine to enable simple but powerful configuration to enable your site search right away. 

Getting the search enabled was not a lot of work. Following this post, I was able to stitch together a basic search that showed results with matched nodes. But what this did not do was the very obvious search feature, stemming search. This search only provided the very basic one-to-one, exact match search, which, in this day and age, with such powerful Google search on everyone’s fingertip, having this lame search meant literal ridicules from end users. So my quest of searching for the holy grail solution of enabling stemming search had begun.

Snowball Analyzer and Lucene.Net v.2.9.4.1 with Contrib Addon

One of the greatest strength of Lucene is its ability to support many types of analyzers. With some research, I found out that it was the Snowball Analyzer that needed to be used in order for stemming search to be possible. Snowball Algorithm is an enhanced version that overcame some of the weaknesses of Porter Stemming Algorithm, and enabled more robust stemming searching and indexing.

Turned out that Umbraco CMS v4.7.2 shipped with Lucene.Net v2.9.2.2, a few minor version releases shy away from v2.9.4.1 that had the Lucene.Net Contrib Addon support which allowed other powerful features like faceted search, spatial search, as well as, added support for Snowball Analyzer and search results highlighter.

Implementation

This meant that we needed to upgrade to Lucene.Net to v2.9.4.1, as well as, UmbracoExamine to v1.4.1 which uses Lucene v2.9.4.1, and finally create a separate Class Library to extend Lucene.Net to bring in Snowball Analyzer from Contrib addon into Umbraco. Then, we can refer this new class from the ExamineIndex configuration as an alternate analyzer to use.

To start, first, download Lucene.Net v2.9.4.1 and Lucene.Net Contrib v2.9.4.1, rename .nupkg to .zip, and grab Lucene.Net.dll, Lucene.Net.Contrib.Core.dll, Lucene.Net.Contrib.Snowball.dll, and Lucene.Net.Contrib.Highlighter.dll

Then, using Visual Studio 2010 or above, create LuceneNetContrib Class Library project from following this instruction. You will need to reference most of the copied DLLs for the project. Compile the release version and grab LuceneNetContrib.dll from the obj/Release folder. 

We also need to get UmbracoExamine v1.4.2 DLLs since the current version was compiled with Lucene.Net v2.9.2.2 and it will complain for not finding it. Download it and grab all of the DLLs that came with the zip file. 

You will now have the following DLLs:

  • Lucene.Net.dll
  • Lucene.Net.Contrib.Core.dll
  • Lucene.Net.Contrib.Snowball.dll
  • Lucene.Net.Contrib.Highlighter.dll
  • LuceneNetContrib.dll
  • Examine.dll
  • ICSharpCode.SharpZipLib.dll
  • UmbracoExamine.dll

Copy them over to Umbraco CMS site’s bin directory. Overwrite any conflicted files.

Now, as a last step, we will need to tell UmbracoExamine that we want to use the Snowball analyzer Class Library we just created instead of the StandardAnalyzer. Edit the config file from \config\ExmineSettings.config:

<Examine>
  <ExamineIndexProviders>
    <providers>
      <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
      <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
      <!-- custom indexer -->
      <add name="BrandCenterSearchIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="true" interval="10" analyzer="LuceneNetContrib.EnglishSnowballAnalyzer, LuceneNetContrib" indexSet="BrandCenterSearchIndexSet" />
    </providers>
  </ExamineIndexProviders>
  <ExamineSearchProviders defaultProvider="InternalSearcher">
    <providers>
      <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
      <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
      <!-- custom searcher -->
      <add name="BrandCenterSearchSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="LuceneNetContrib.EnglishSnowballAnalyzer, LuceneNetContrib" indexSet="BrandCenterSearchIndexSet" enableLeadingWildcards="true" />
    </providers>
  </ExamineSearchProviders>
</Examine>

Note that the SnoballAnalyzer has to be referenced on both index and search providers.

Now, we need to delete existing repository so that the site can be re-indexed using Snowball Analyzer. Existing repository location is set from ExamineIndex.config. Usually, it is located in \wwwroot\App_Data\TEMP\ExamineIndexes folder. Delete the existing index folder, perform IISRESET. As soon as IIS restarts, you will see that UmbracoExamine starts to reindex the entire site by recreating the index folder and start putting content in there.

Sometimes, you might have issues UmbracoExamine reindexing. Just delete the repository again and perform IISRESET to get it reindexed properly. In addition to enabling stemming, I also incorporated highlighter. The complete project can be found on my Github.

Conclusion

Umbraco CMS is a great nifty open sourced CMS with power Lucene integrated. With some effort, we saw how we can add additional analyzers like Snowball Analyzer and even enable search results highlighting.

Download all of the source code, library packages, configuration files, as well as Umbraco CMS search results module from my Github.

Enjoy!

Share.

About Author

Avatar photo

An avid technologist, entrepreneur at heart, has the determination to make this world a better place by contributing the most useful articles to ThingsYouMustKnow.com, one article at a time.

Comments are closed.