Alfresco, installing OCR as an external service

Alfresco Simple OCR Action has became a popular alternative to provide an OCR service to Alfresco Community servers running Linux or Windows.

In many use cases, configuration guide is enough, but there are some other scenarios where intensive use of the OCR service requires a more complex deployment. Below, is described a configuration where OCR service is installed on an external server, which allows mantaining Alfresco capacity independently on how many OCR operations are running.

Alfresco is using SSH to communicate with the OCR server running pdfsandwich on CentOS 7 in this sample, but any other protocol, OS or OCR software can be selected.

Configuring OCR server

Software requered:

  • CentOS Linux release 7.3.1611 (Core)
  • pdfsandwich 0.1.4

Ports required:

  • 22 (SSH): OCR service

pdfsandwich program is used to build searchable PDFs from PDFs containing images or TIFF files. This program will be invoked by command line from Alfresco by using Alfresco Simple OCR addon.

Installing pdfsandwich

Installing required dependencies.

$ yum -y install wget gcc gcc-c++ make autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel ocaml ImageMagick ImageMagick-devel

Installing leptonica from source code.

$ wget
$ tar xvf leptonica-1.72.tar.gz
$ cd leptonica-1.72
$ ./configure
$ make
$ make install

Installing tesseract OCR from source code.

$ wget
$ tar xvf 3.04.01.tar.gz
$ cd tesseract-3.04.01
$ ./
$ ./configure
$ make
$ make install
$ ldconfig

Installing every language package for tesseract.

$ wget
$ tar xvf 3.04.00.tar.gz
$ mv tessdata-3.04.00/* /usr/local/share/tessdata/

Installing unpaper by using RPM.

$ wget
$ rpm -ivh unpaper-0.3-4.el6.x86_64.rpm

Installing pdfsandwich from source code.

$ wget
$ tar xvf pdfsandwich-0.1.4.tar.bz2
$ cd pdfsandwich-0.1.4
$ ./configure
$ make
$ make install

Verifying the software has been installed properly.

$ pdfsandwich -version
pdfsandwich version 0.1.4

Configuring Alfresco server

Once Alfresco is installed and Alfresco Simple OCR is available, a script is created to invoke remote OCR server.


# pdfsandwich hostname

# extract filenames
INPUT=$(basename "$3")
OUTPUT=$(basename "$5")

# SSH parameters

# copy original pdf to pdfsandwich server

# execute pdfsandwich program (requires administrator privileges)
$SSH root@$PDFSANDWICH_SERVER "pdfsandwich $1 $2 /tmp/$INPUT $4 /tmp/$OUTPUT $6"

# copy transformed pdf back to alfresco server

# remove temporal files
$SSH root@$PDFSANDWICH_SERVER "rm -f /tmp/$INPUT; rm -f /tmp/$OUTPUT"

An RSA key is required to communicate both servers by using SSH without user interaction.

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/
$ ssh-copy-id -i ~/.ssh/ has to be updated to use above script when invoking pdfsandwich.


Alfresco is restarted to apply configuration.

$ systemctl restart alfresco

Final words

Once our system is up and ready, OCR tasks are sent by Alfresco to external OCR server, allowing to maintain Alfresco quality service.
Even increasing capacity for OCR service by using async thread pool defined at will not impact in Alfresco server.

# Default Async Action Thread Pool

It happens that Alfresco quality service does not rely on what software is installed but on how that software is installed.


Un comentario en “Alfresco, installing OCR as an external service


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s