Tuesday, March 22, 2011

Implementing search hits highlight in Alfresco Share




While prospecting on a freelance site, I found a client that submitted a project consisting in a Java search application.

He wanted it to be based on Lucene for indexing and searching, but Alfresco already does that.

He wanted also the application to support various document formats (MS Office document types, Open Office document types, PDF, ...) and Alfresco does that thanks to its transformers based on Open Office.

He needed also tagging feature, Alfresco already implements that.

And wanted user to be able to browse the documents repository in Windows Explorer, like a remote drive, such thing is doable with the Java implementation of CIFS, JILAN. One more time, Alfresco provides that.

All his needs in that project were already satisfied by Alfresco, except this one:
...
• Extract of the file showing search terms highlighted.

So I spent the whole night investigating on the subject till I found this post. It was actually doable thanks to the Lucene contrib project : Highlight.

Unfortunately someone else won the project in the meanwhile, but anyway it is worth sharing.

In this post we will go through the most important steps to implement this feature on Alfresco community 3.4.d, svn revision number 25020.

First of all, we need to store documents content when adding them to Lucene index, or at least a part of it when that content is too big to hold in memory in a String instance. Alfresco doesn't do that, in their code they don't store content but just tokenize it, certainly to minimize index size.

You must be aware that having that feature means having a little bit bigger indexes, but thankfully Lucene can store that content in a compressed format. We'll expose that feature in model so that you'll be able to enable/disable content compression at any time by means of configuration.

To do that we need to extend the model schema and add a new sub-element compressed to element index.

[File: /Repository/config/alfresco/model/modelSchema.xsd]

And following are the consequent additions.

[File: /DataModel/source/java/org/alfresco/service/cmr/dictionary/PropertyDefinition.java]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/M2Property.java]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/m2binding.xml]

[File: /DataModel/source/java/org/alfresco/repo/dictionary/M2PropertyDefinition.java]


Now we're done with (the boring) model stuff:

[File: /Repository/config/alfresco/model/contentModel.xml]

Compression can enabled by setting compressed element to true.

Next we will pursue with the Java code change that will store document content in the index.

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/index/IndexInfo.java]

We've unlocked field length limit and set it to MaxFieldLength.UNLIMITED.

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/ADMLuceneIndexerImpl.java]
[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/AVMLuceneIndexerImpl.java]

Here I'm storing max caracters from document content, which is 1/100 of the free memory available at that time. Because it is not practical to store the whole content in index, it may exceed available memory size and break running jvm. The other reason for that is that documents may be indexed by batch. We shouldn't consume all of the free memory for just one document, who know's what's coming next in the running thread ...

So you have to play with -X aguments and tune your jvm to get a bigger extract of content stored.

Notice that we're asking Lucene to store term vectors (more informations about term vertors) with offsets and positions, whom are needed by highlighter, such thing will help us get faster results later, saving us time required to generate them at search time.

Term vertors will take additional space on disk also, we sacrifice disk space and index processing time to get faster response. Knowing that large documents indexing is performed asynchnously by Alfresco, what makes it affordable.

That's all for storing content, now let's move to the interesting part, generating documents fragments with keywords highlight at search time.

Following are the key additions.

[File: /DataModel/source/java/org/alfresco/service/cmr/search/ResultSetSPI.java]

[File: /Repository/source/java/org/alfresco/repo/search/impl/lucene/LuceneResultSet.java]

Code that produces fragments and highlights search keywords is between line 28 and 35.

After we recuperate content and term vertor we've saved previously, we create a Scorer, inform the Fragmenter about fragments length we want (70), we tell Highlighter the desired css colors associated with terms having respectively lowest and highest scores (white, yellow), set MaxDocCharsToAnalyze to the biggest possible values to give ourselves more chances to hit fragments (Integer.MAX_VALUE) and finally we ask Highlighter for a number (3) of fragments.

These are settings used to get something similar to the screenshot, feel free to set your own ones.

At this level, the hardest part has been done. That is we're able to store documents contents with their terms and positions verctor and our Lucene ResultSet implementation can fetch highlighted documents fragments. Now we have to tell search component to fetch those fragments !

[File: /DataModel/source/java/org/alfresco/service/cmr/search/ResultSetRow.java]

[File: /Repository/source/java/org/alfresco/repo/search/AbstractResultSetRow.java]

We add an extra-property to ScriptNode wrapper class, since we're going to use this feature in a data webscript.

[File: /Repository/source/java/org/alfresco/repo/jscript/ScriptNode.java]

[File: /Repository/source/java/org/alfresco/repo/jscript/Search.java]


Then we update the search data web script.

[File: /Remote API/config/alfresco/templates/webscripts/org/alfresco/slingshot/search/search.lib.js]

[File: /Remote API/config/alfresco/templates/webscripts/org/alfresco/slingshot/search/search.get.json.ftl]

Ouff, we've finished implementing the feature !

Now we just have to tell Alfresco Share how to display fragments.

[File: /Slingshot/source/web/components/search/search.js]

[File: /Slingshot/source/web/components/search/search.css]

Remember to add lucene highlighter to the project.
[File: /3rd Party/lucene-highlighter-2.4.1.jar]


For completeness, here's the svn patch for Alfresco CE 3.4.d revision number 25020.

I hope it was useful.

Any question, remark or suggestion is welcome :-)

21 comments:

  1. Hi,

    We tried this approach first and then changed it because the Lucene index was growing quite large in production. Instead we created a child association to the content and stored the text version in the repository. Similar to how renditions are stored.

    Ainga

    ReplyDelete
  2. Hi Aingaran,

    That would require querying node service for each node to get content at LuceneResultSet level.

    That may slow down search response time significantly, but if index size becomes too big to put up with, one should choose your solution for content storage.

    Thanks,
    Aymen

    ReplyDelete
  3. Hi

    Good stuff.

    I would store non d:content properties compressed in the index and d:content properties after transformation in a separate content store (as transformation may not be repeatable). Alfresco is adding SOLR support which will (eventually) include possible high-lighting with this model.

    The dictionary will get some extra properties to control high-lighting starting along the lines you have suggested.

    Andy Hind

    ReplyDelete
  4. Hi Andy,

    I'm glad to hear that!

    It would interesting if also types of formatting would be accessible to users from config, like bold style (SimpleHTMLFormatter) or colors style (GradientFormatter), or any other formatting types if you plan to add any.

    I'm definitely convinced that content should go out of index. I'm probably going to implement Ainga idea, but if I do that MaxDocCharsToAnalyze should not remain unlimited.

    What do you (Ainga and Andy) think about it ?

    Thank you :)

    Regards,

    ReplyDelete
  5. Hi Aymen,

    really good article!

    I patched my local version succesfully but the fragments are allways empty. I checked that the content is realy stored inside the index.

    But the "highlighter.getBestFragments" returns always an empty array.

    Do you have any suggestions?

    Many thanks,
    Martin

    ReplyDelete
  6. Hi Martin,

    Have you tried getting fragments with a small document ?

    Is there any exception related to this in logs ?

    In case you've changed code, there's something to mention, you should not reuse the TokenStream used to create the scorer when invoking highlighter.getBestFragments(), you must create a new one TokenSources.getTokenStream(termPosVector).

    If you still don't get fragments, we'll have to schedule a remote session on your local machine to inspect this with debugger.

    Keep me tuned in both cases :)

    Thank you,
    Aymen

    ReplyDelete
  7. Hi Aymen,

    thanks for the fast response.

    I checked out a revision 25020 and applyed your patch again. After deploying the new WAR and recreating the indexes it works successfully with Alfreso Share!!!

    Now I tested it with a webscript and there the fragment is always empty.

    This is the code snippet:
    ...
    <#if doc.fragments??>
    <#list doc.fragments as fragment>${fragment}

    ...

    ReplyDelete
  8. Hi again,

    I solved the problem on my own!

    The problem was inside the query-string because I searched inside the TEXT-property and not inside the index itself. That was the reason why I got a result but no fragments.

    So the syntax of the query is very important!

    Regards,
    Martin

    ReplyDelete
  9. Hi Martin,

    I'm glad to hear that and thank you for your interest.

    I'm going to rewrite this solution to avoid having too big index, as mentioned in previous comments.

    You may check the blog in a while if you're interested.

    Regards,
    Aymen

    ReplyDelete
  10. Hi Aymen,

    Great work! Did you open a jira at Alfresco to attach the patch there? It'd be great to see this work end up in the main product.

    Cheers,

    --Kurt

    ReplyDelete
  11. Hi Kurt,

    Thank you for your comment :-)

    It needs however to be improved a little bit and store content elsewhere rather that in index, if I succeed to find time to do that I'd inform Alfresco guys about it.

    Best regards,

    ReplyDelete
  12. Thanks Aymen,

    Looking forward to it. You may want to open a jira for it already to get it on their radar :). Maybe they will even help since I think it would be a great feature for them to have in the base product.

    Cheers,

    --Kurt

    ReplyDelete
  13. Hi Aymen,
    I am not able to do it. I have created the package structure and edited all the classes that you have told in your post.
    I have edited the modelSchema.xsd and contentModel.xml directly for the time being. I was not able to find the m2binding.xm file so I have created it.
    By the Ant script I am creating a jar and putting it in lib folder of alfresco. But it is not working... Please help... :-(

    ReplyDelete
  14. Are you using Alfresco CE 3.4.d revision number 25020 ?

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. Hi,
    I hope everyone hasn't gone home! This looks like a brilliant and fundamental piece of work. I'm just starting out with Alfresco 4.0 CE. Can I use this patch? Or has the SOLR integration happened? What is the current status of this subject.

    Thanks
    Simon

    ReplyDelete
  21. Hello,

    I found a bug in your code at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl:

    int max = (int)Runtime.getRuntime().freeMemory() / 100;

    If you have more than 2 Gb of available memory then always returns negative number because Integer.MAX_SIZE = 2 exp 31 - 1

    Please fix it as follows:

    int max = (int)(Runtime.getRuntime().freeMemory() / 100);

    ReplyDelete