Alfresco: what happens when a folder is moved

Recently we had a use case where a folder including some thousands of hierarchical subfolders had to be moved inside Alfresco to another location.

alfresco-move-folder

Our client tried to move the folder, received a timeout error in Share web client and Alfresco server crashed some hours later by an OutOfMemoryError.

My first thought was that both actions were not linked, as a simple inspection at move-to.post.json.js revealed that the exception was captured but it was never logged. So, I tried again overwriting this web script to include a simple log sentence inside the catch. Surprisingly the behavior was exactly as before. And the exception was not logged.

I guessed that the process was still running in Alfresco repo, but a Share HTTP timeout had been returned. Executing a kill -3 to Alfresco Java process, I found following stack trace for the running thread:

"ajp-apr-8009-exec-8" #84 daemon prio=5 os_prio=0 tid=0x00007fabe8015800 nid=0x5fb7 runnable [0x00007fab9f402000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)

    ...

	at org.alfresco.repo.domain.node.AbstractNodeDAOImpl.getChildAssocs(AbstractNodeDAOImpl.java:3484)

    ...

    at org.alfresco.repo.rule.ruletrigger.OnMoveNodeRuleTrigger.triggerChildrenRules(OnMoveNodeRuleTrigger.java:84)
	at org.alfresco.repo.rule.ruletrigger.OnMoveNodeRuleTrigger.triggerChildrenRules(OnMoveNodeRuleTrigger.java:86)
	at org.alfresco.repo.rule.ruletrigger.OnMoveNodeRuleTrigger.onMoveNode(OnMoveNodeRuleTrigger.java:69)

    ...

    at org.alfresco.repo.model.filefolder.FileFolderServiceImpl.moveOrCopy(FileFolderServiceImpl.java:1115)
	at org.alfresco.repo.model.filefolder.FileFolderServiceImpl.moveFrom(FileFolderServiceImpl.java:997)

    ...
    
    at org.mozilla.javascript.gen.classpath__alfresco_extension_templates_webscripts_org_alfresco_slingshot_documentlibrary_action_move_to_post_json_js_9._c_runAction_19(classpath*:alfresco/extension/templates/webscripts/org/alfresco/slingshot/documentlibrary/action/move-to.post.json.js:930)

Alfresco was trying to find and trigger rules for every subfolder below targeted one (maybe a million) and all this job was being performed on the same database transaction. So, in the end, Alfresco crashed because of the memory consumption required by this huge transaction.

I decided to try again the operation but disabling globally rules previously (thanks for this, Douglas CR Paes) by using a script in JavaScript Console.

var context = Packages.org.springframework.web.context.ContextLoader.getCurrentWebApplicationContext();
var ruleService = context.getBean('RuleService', Packages.org.alfresco.service.cmr.rule.RuleService);

ruleService.disableRules();
logger.warn("After disabling: " + ruleService.isEnabled());

Once disabled, the operation run softly, so I enabled rules again in JavaScript Console.

var context = Packages.org.springframework.web.context.ContextLoader.getCurrentWebApplicationContext();
var ruleService = context.getBean('RuleService', Packages.org.alfresco.service.cmr.rule.RuleService);

ruleService.enableRules();
logger.warn("After enabling: " + ruleService.isEnabled());

After a while, some AJP timeout errors started to log at Apache HTTP. SOLR, which was installed in another server, had been querying intensively Alfresco. And I discovered that weird Cascade Update aspect.

alfresco-cascade-update

Every time a node is moved, a two properties (cascadeTx and cascadeCRC) aspect is created for that node. Surprisingly, this aspect is never removed. So every folder that once was moved in the system, has the aspect set.

alfresco-search-cascade

The aspect is used by SOLR to re-index the content of a moved folder, as it can be seen at SolrInformationServer class, and it includes a full and granular re-index operation for every children below that moved folder. So, as the operation is not designed again for heavy volumes, Tomcat AJP connector (or even HTTP one) can collapse.

And the worse part is that, if the operation has been successfully executed, Cascade Update aspect is not removed. So it lasts forever as junk in every moved node.

Some everyday operations in Alfresco, like moving a folder, seems to be as easy as an update in a database. However they can provoke serious problems in real systems as they involve different operations that has not been designed for heavy use.

Note As this is an enterprise customer, I have raised a new issue at Alfresco Support. I’ll update this post if I receive further information.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s