Language detection during indexing
This year Alfresco has celebrated a Hack-a-thon in Spring, in addition to the classic Autumn Global Virtual Hack-a-thon happening in October during the last years.
A hack-a-thon is an event in which software developers, architects, interface designers and – to a lesser extent – project managers collaborate in a restricted time span on a project of their choice outside of the normal work environment and its restrictions. The projects can be anything – experimental prototypes and extensions to existing functionality that people normally don’t get around to coding are popular options. In the context of the Alfresco community we have expanded this meaning to cover anything related to the Alfresco product, its ecosystem and community. You can find a detailed view of the projects developed during this hack-a-thon in the Community page.
This year I’ve been focused during the event on language auto detection during indexing. On current Alfresco versions, content language is set at Alfresco Repository by using client locale configuration. When indexing, SOLR takes this language from repository to perform the indexation. However, when working on cross-locale environments, some users are uploading content in a different language from client language settings. Having the right language identification will provide better results when searching.
A first approach to this concept has been drafted at:
I was using LangDetect library in class SolrInformationServer to inspect the first 10k characters of text from every document in other to set the locale based on this language detection. This little tweak allows the content to be indexed with the right locale, without relying on the erratic previous behaviour, based in repository browser detection one.
During the session, this auto detection feature has been tested with a wide document catalog in different languages (English, Spanish, French and Tagalo) and the results were very accurate.
I was using URLs like the following one to inspect the property of the locale field in SOLR, as using the facet.field property allows to select what Document properties are being returned by the query:
At the end of the day, this is my recap:
- When reading all the document to auto detect the language, execution time grows proportionally to document extension. This could be useful if Alfresco SOLR Model were storing locale property as a multiple value, but this property is simple by now. This is why I included a text length limitation, to control the performance of the feature.
- Tika also provides a langdetect library, but this library relies in Google Guava 16.01, that is incompatible with Alfresco SOLR Maven project. Externalising the auto detection could be considered for using Tika or any other tool like Textract for language detection.
I’ve been sharing the day in the Hacker Room with my former colleagues Daniel Fernández and David Martos and virtually with many other well-known Alfresco Developers. As in the past years, Axel Faust and Francesco Corti have been following the session during all the Sun-to-Sun day.
If you didn’t participate this year, it’s time to prepare the Alfresco Global Virtual Hack-a-thon – Autumn 2019!