Alfresco, counting more than 1000 elements

Many people need to count elements inside the repository. In a common repository, having more than 1,000 elements from the same type or aspect is a regular scenario.

In this blog post, several ways of counting elements in Alfresco repository are exposed.

Problem statement

How many nodes having businessDocument aspect are in the repository?

Let’s assume the following content model: a base aspect named ust:businessDocument and two inheriting aspects named ust:inboundDoc and ust:outboundDoc

<aspects>
    <aspect name="ust:businessDocument">
        <properties>
            <property name="ust:docDate">
                <type>d:datetime</type>
            </property>
        </properties>
    </aspect>
    <aspect name="ust:inboundDoc">
        <parent>ust:businessDocument</parent>
        <properties>
            <property name="ust:receivedDate">
                <type>d:datetime</type>
            </property>
        </properties>
    </aspect>
    <aspect name="ust:outboundDoc">
        <parent>ust:businessDocument</parent>
        <properties>
            <property name="ust:sentDate">
                <type>d:datetime</type>
            </property>
        </properties>
    </aspect>
</aspects>

So every content having any of these three aspects must be included.

For this sample, I’ve prepared a repository with 2,403 nodes including any of these aspects.

Using CMIS

A simple Groovy script can be developed by using a simple CMIS Query

import org.apache.chemistry.opencmis.commons.*
import org.apache.chemistry.opencmis.commons.data.*
import org.apache.chemistry.opencmis.commons.enums.*
import org.apache.chemistry.opencmis.client.api.*
import org.apache.chemistry.opencmis.client.util.*

String cql = "SELECT cmis:objectId FROM ust:businessDocument"

OperationContext opCon = session.createOperationContext();
opCon.setMaxItemsPerPage(1000000);

ItemIterable<QueryResult> results = session.query(cql, false, opCon)

println "--------------------------------------"
println "Total number: ${results.totalNumItems}"
println "Has more: ${results.hasMoreItems}"
println "--------------------------------------"


--------------------------------------
Total number: 1000
Has more: true
--------------------------------------

However, CMIS (and also FTS) can only retrieve 1,000 elements. You can play with paging and skipping, but there is no (simple) way to obtain more than 1,000.

Using Database

It’s not recommended to play with Alfresco Database, but it looks like this is the right chance to do it.

Let’s start with a simple query to see what happens.

SELECT count(1)
FROM alf_node AS n,
  alf_node_aspects AS a,
  alf_qname AS q,
  alf_namespace AS ns,
  alf_store AS s
WHERE a.qname_id = q.id
  AND a.node_id = n.id
  AND q.ns_id = ns.id
  AND n.store_id = s.id
  AND s.protocol = 'workspace'
  AND s.identifier = 'SpacesStore'
  AND ns.uri = 'http://www.ust-global.com/model/business/1.0'
  AND q.local_name in ('businessDocument');
 count
-------
   801
(1 row)

It looks like parent aspects are not related with the node, so we need to include every inherited aspect in the query.

SELECT count(1)
FROM alf_node AS n,
  alf_node_aspects AS a,
  alf_qname AS q,
  alf_namespace AS ns,
  alf_store AS s
WHERE a.qname_id = q.id
  AND a.node_id = n.id
  AND q.ns_id = ns.id
  AND n.store_id = s.id
  AND s.protocol = 'workspace'
  AND s.identifier = 'SpacesStore'
  AND ns.uri = 'http://www.ust-global.com/model/business/1.0'
  AND q.local_name in ('businessDocument', 'inboundDoc', 'outboundDoc');
 count
-------
  2403
(1 row)

So we have the number we were looking for, but we are scanning alf_node table to get it: database performance could be degraded!

Using SOLR

Skipping all Alfresco overload, we can use directly SOLR Engine to perform a query in alfresco core

https://localhost/solr/alfresco/afts?q=ASPECT:%22ust:businessDocument%22

<response>
	<lst name="responseHeader">
		<int name="status">0</int>
		<int name="QTime">6</int>
		<lst name="params">
			<str name="q">ASPECT:"ust:businessDocument"</str>
		</lst>
	</lst>
	<result name="response" numFound="2403" start="0">
		<doc>
			<str name="id">_DEFAULT_!8000000000000019!8000000000005279</str>
			<long name="_version_">0</long>
			<long name="DBID">21113</long>
		</doc>
	</result>
	<bool name="processedDenies">false</bool>
</response>

So we found that numFound 2,403 by using a simple query and without degrading Alfresco performance.

REST API

After talking a while with Younes Regaieg and Axel Faust at Alfresco IRC, I realised that there is also a way to provoke a SOLR query when using REST API invocation. When including a query that cannot be processed as TMQ, Alfresco is also counting the elements when using REST API.

This is why we are adding name:* to the query in the following code snippet.

[POST]

https://localhost/alfresco/api/-default-/public/search/versions/1/search

[Payload]

{
  "query": 
  {
     "language": "afts", 
     "query": "name:* AND ASPECT:\"ust:businessDocument\""
  }
}

Results are included in totalItems field in the response:

{
    "list": {
        "pagination": {
            "count": 100,
            "hasMoreItems": true,
            "totalItems": 2403,
            "skipCount": 0,
            "maxItems": 100
        },

The same thing can be obtained in a different way, but there is always an alternative that is better than the others. Some experimentation on test environments can save you more troubles in your real service!

Published by angelborroy

Understanding software.

10 thoughts on “Alfresco, counting more than 1000 elements

    1. Just to make it easy for the readers, Search API is exactly the same as CMIS: you can only count 1,000 elements.

      1. I don’t think that would be the case; Unless when your query goes through DB. A trick to force it to go to solr would be to add a wildcard like “cm:name:*”

      2. Thanks for your contribution. The blog post has been updated with your tip.

  1. Hey, Gethin, we made it again.

    Axel and Younes helped me to understand how to get that count by using the “new” REST API.

    Thanks to all!

  2. Angel:
    Creo que el script de groovy tiene algún ; de más al final de linea (en el OperationContext)

    Saludos.
    –C:

    1. I don’t remember about NodeService, but for SearchService you can use following pattern:

      SearchParameters sp = new SearchParameters();
      sp.setLanguage(SearchService.LANGUAGE_CMIS_ALFRESCO);
      sp.addStore(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE);
      sp.setMaxItems(Integer.MAX_VALUE);
      sp.setMaxPermissionChecks(Integer.MAX_VALUE);

      1. Hi Thanks for your response, is this way supports following things?

        1) total counts in response along with node ids
        2) filter by aspect name, file name, creation date
        3) sort by creation date
        4) pagination

        If yes, can you give example with it?

        Regards,
        Vishal Z.

  3. The problem with the CMIS example in this article is the query used.
    Alfresco could run CMIS query against either SOLR or Database depending on the query (i.e. transactional metadata queries), and “SELECT cmis:objectId FROM ust:businessDocument” will go against database. According to the documentation, if the query goes against database, Alfresco will not set totalNumItems properly because of the way it calculates ACL’s, except if you run it a user named “System” – in this case ACL’s calculation is skipped (yes, there is literally a check in Alfresco code userName.equals(“System”) :) ). Now, if you modify the query to run against SOLR by adding IN_TREE to it ( CONTAINS may also work, but not always), it will set totalNumItems properly, i.e.the following query
    SELECT cmis:objectId FROM ust:businessDocument where in_tree(“noderef of app:company_home”)
    (use whatever URL you need to get children under, append ” AND in_folder(“same url”)” to limit it to current folder and not wonder into hierarchy)
    will return the right value of totalNumItems.
    All of this is assuming you left the default value for search configuration “Use database if possible” unchanged.

    Aside from that, there is no need to retrieve more then 1 node in order to get count, i.e. you should be using
    opCon.setMaxItemsPerPage(1);

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s