Recently we had a use case where a folder including some thousands of hierarchical subfolders had to be moved inside Alfresco to another location.
Our client tried to move the folder, received a timeout error in Share web client and Alfresco server crashed some hours later by an
My first thought was that both actions were not linked, as a simple inspection at move-to.post.json.js revealed that the exception was captured but it was never logged. So, I tried again overwriting this web script to include a simple log sentence inside the catch. Surprisingly the behavior was exactly as before. And the exception was not logged.
I guessed that the process was still running in Alfresco repo, but a Share HTTP timeout had been returned. Executing a
kill -3 to Alfresco Java process, I found following stack trace for the running thread:
Alfresco was trying to find and trigger rules for every subfolder below targeted one (maybe a million) and all this job was being performed on the same database transaction. So, in the end, Alfresco crashed because of the memory consumption required by this huge transaction.
var context = Packages.org.springframework.web.context.ContextLoader.getCurrentWebApplicationContext(); var ruleService = context.getBean('RuleService', Packages.org.alfresco.service.cmr.rule.RuleService); ruleService.disableRules(); logger.warn("After disabling: " + ruleService.isEnabled());
var context = Packages.org.springframework.web.context.ContextLoader.getCurrentWebApplicationContext(); var ruleService = context.getBean('RuleService', Packages.org.alfresco.service.cmr.rule.RuleService); ruleService.enableRules(); logger.warn("After enabling: " + ruleService.isEnabled());
After a while, some AJP timeout errors started to log at Apache HTTP. SOLR, which was installed in another server, had been querying intensively Alfresco. And I discovered that weird Cascade Update aspect.
Every time a node is moved, a two properties (cascadeTx and cascadeCRC) aspect is created for that node. Surprisingly, this aspect is never removed. So every folder that once was moved in the system, has the aspect set.
The aspect is used by SOLR to re-index the content of a moved folder, as it can be seen at SolrInformationServer class, and it includes a full and granular re-index operation for every children below that moved folder. So, as the operation is not designed again for heavy volumes, Tomcat AJP connector (or even HTTP one) can collapse.
And the worse part is that, if the operation has been successfully executed, Cascade Update aspect is not removed. So it lasts forever as junk in every moved node.
Some everyday operations in Alfresco, like moving a folder, seems to be as easy as an update in a database. However they can provoke serious problems in real systems as they involve different operations that has not been designed for heavy use.
Note As this is an enterprise customer, I have raised a new issue at Alfresco Support. I’ll update this post if I receive further information.