Tuesday, March 22, 2011

Implementing search hits highlight in Alfresco Share




While prospecting on a freelance site, I found a client that submitted a project consisting in a Java search application.

He wanted it to be based on Lucene for indexing and searching, but Alfresco already does that.

He wanted also the application to support various document formats (MS Office document types, Open Office document types, PDF, ...) and Alfresco does that thanks to its transformers based on Open Office.

He needed also tagging feature, Alfresco already implements that.

And wanted user to be able to browse the documents repository in Windows Explorer, like a remote drive, such thing is doable with the Java implementation of CIFS, JILAN. One more time, Alfresco provides that.

All his needs in that project were already satisfied by Alfresco, except this one:
...
• Extract of the file showing search terms highlighted.

So I spent the whole night investigating on the subject till I found this post. It was actually doable thanks to the Lucene contrib project : Highlight.

Unfortunately someone else won the project in the meanwhile, but anyway it is worth sharing.

In this post we will go through the most important steps to implement this feature on Alfresco community 3.4.d, svn revision number 25020.

First of all, we need to store documents content when adding them to Lucene index, or at least a part of it when that content is too big to hold in memory in a String instance. Alfresco doesn't do that, in their code they don't store content but just tokenize it, certainly to minimize index size.

You must be aware that having that feature means having a little bit bigger indexes, but thankfully Lucene can store that content in a compressed format. We'll expose that feature in model so that you'll be able to enable/disable content compression at any time by means of configuration.

To do that we need to extend the model schema and add a new sub-element compressed to element index.

[File: /Repository/config/alfresco/model/modelSchema.xsd]

And following are the consequent additions.

[File: /DataModel/source/java/org/alfresco/service/cmr/dictionary/PropertyDefinition.java]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/M2Property.java]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/m2binding.xml]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/M2PropertyDefinition.java]


Now we're done with (the boring) model stuff:

[File: /Repository/config/alfresco/model/contentModel.xml]

Compression can enabled by setting compressed element to true.

Next we will pursue with the Java code change that will store document content in the index.

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/index/IndexInfo.java]

We've unlocked field length limit and set it to MaxFieldLength.UNLIMITED.

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/ADMLuceneIndexerImpl.java]
[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/AVMLuceneIndexerImpl.java]

Here I'm storing max caracters from document content, which is 1/100 of the free memory available at that time. Because it is not practical to store the whole content in index, it may exceed available memory size and break running jvm. The other reason for that is that documents may be indexed by batch. We shouldn't consume all of the free memory for just one document, who know's what's coming next in the running thread ...

So you have to play with -X aguments and tune your jvm to get a bigger extract of content stored.

Notice that we're asking Lucene to store term vectors (more informations about term vertors) with offsets and positions, whom are needed by highlighter, such thing will help us get faster results later, saving us time required to generate them at search time.

Term vertors will take additional space on disk also, we sacrifice disk space and index processing time to get faster response. Knowing that large documents indexing is performed asynchnously by Alfresco, what makes it affordable.

That's all for storing content, now let's move to the interesting part, generating documents fragments with keywords highlight at search time.

Following are the key additions.

[File: /DataModel/source/java/org/alfresco/service/cmr/search/ResultSetSPI.java]

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/LuceneResultSet.java]

Code that produces fragments and highlights search keywords is between line 28 and 35.

After we recuperate content and term vertor we've saved previously, we create a Scorer, inform the Fragmenter about fragments length we want (70), we tell Highlighter the desired css colors associated with terms having respectively lowest and highest scores (white, yellow), set MaxDocCharsToAnalyze to the biggest possible values to give ourselves more chances to hit fragments (Integer.MAX_VALUE) and finally we ask Highlighter for a number (3) of fragments.

These are settings used to get something similar to the screenshot, feel free to set your own ones.

At this level, the hardest part has been done. That is we're able to store documents contents with their terms and positions verctor and our Lucene ResultSet implementation can fetch highlighted documents fragments. Now we have to tell search component to fetch those fragments !

[File: /DataModel/source/java/org/alfresco/service/cmr/search/ResultSetRow.java]

[File: /Repository/source/java/org/alfresco/repo/search/AbstractResultSetRow.java]

We add an extra-property to ScriptNode wrapper class, since we're going to use this feature in a data webscript.

[File: /Repository/source/java/org/alfresco/repo/jscript/ScriptNode.java]

[File: /Repository/source/java/org/alfresco/repo/jscript/Search.java]


Then we update the search data web script.

[File: /Remote API/config/alfresco/templates/webscripts/org/alfresco/slingshot/search/search.lib.js]

[File: /Remote API/config/alfresco/templates/webscripts/org/alfresco/slingshot/search/search.get.json.ftl]

Ouff, we've finished implementing the feature !

Now we just have to tell Alfresco Share how to display fragments.

[File: /Slingshot/source/web/components/search/search.js]

[File: /Slingshot/source/web/components/search/search.css]

Remember to add lucene highlighter to the project.
[File: /3rd Party/lucene-highlighter-2.4.1.jar]


For completeness, here's the svn patch for Alfresco CE 3.4.d revision number 25020.

I hope it was useful.

Any question, remark or suggestion is welcome :-)