Alfresco, installing OCR as an external service

Alfresco Simple OCR Action has became a popular alternative to provide an OCR service to Alfresco Community servers running Linux or Windows.

In many use cases, configuration guide is enough, but there are some other scenarios where intensive use of the OCR service requires a more complex deployment. Below, is described a configuration where OCR service is installed on an external server, which allows mantaining Alfresco capacity independently on how many OCR operations are running.

Alfresco is using SSH to communicate with the OCR server running pdfsandwich on CentOS 7 in this sample, but any other protocol, OS or OCR software can be selected.

Configuring OCR server

Software requered:

  • CentOS Linux release 7.3.1611 (Core)
  • pdfsandwich 0.1.4

Ports required:

  • 22 (SSH): OCR service

pdfsandwich program is used to build searchable PDFs from PDFs containing images or TIFF files. This program will be invoked by command line from Alfresco by using Alfresco Simple OCR addon.

Installing pdfsandwich

Installing required dependencies.

$ yum -y install wget gcc gcc-c++ make autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel ocaml ImageMagick ImageMagick-devel

Installing leptonica from source code.

$ wget http://www.leptonica.org/source/leptonica-1.72.tar.gz
$ tar xvf leptonica-1.72.tar.gz
$ cd leptonica-1.72
$ ./configure
$ make
$ make install

Installing tesseract OCR from source code.

$ wget http://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
$ tar xvf 3.04.01.tar.gz
$ cd tesseract-3.04.01
$ ./autogen.sh
$ ./configure
$ make
$ make install
$ ldconfig

Installing every language package for tesseract.

$ wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
$ tar xvf 3.04.00.tar.gz
$ mv tessdata-3.04.00/* /usr/local/share/tessdata/

Installing unpaper by using RPM.

$ wget http://dl.fedoraproject.org/pub/epel/6/x86_64/unpaper-0.3-4.el6.x86_64.rpm
$ rpm -ivh unpaper-0.3-4.el6.x86_64.rpm

Installing pdfsandwich from source code.

$ wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.1.4/pdfsandwich-0.1.4.tar.bz2
$ tar xvf pdfsandwich-0.1.4.tar.bz2
$ cd pdfsandwich-0.1.4
$ ./configure
$ make
$ make install

Verifying the software has been installed properly.

$ pdfsandwich -version
pdfsandwich version 0.1.4

Configuring Alfresco server

Once Alfresco is installed and Alfresco Simple OCR is available, a script is created to invoke remote OCR server.


#!/bin/bash

# pdfsandwich hostname
PDFSANDWICH_SERVER="alfresco-ocr.keensoft.es"

# extract filenames
INPUT=$(basename "$3")
OUTPUT=$(basename "$5")

# SSH parameters
SCP=scp
SSH=ssh

# copy original pdf to pdfsandwich server
$SCP $3 root@$PDFSANDWICH_SERVER:/tmp/$INPUT

# execute pdfsandwich program (requires administrator privileges)
$SSH root@$PDFSANDWICH_SERVER "pdfsandwich $1 $2 /tmp/$INPUT $4 /tmp/$OUTPUT $6"

# copy transformed pdf back to alfresco server
$SCP root@$PDFSANDWICH_SERVER:/tmp/$OUTPUT $5

# remove temporal files
$SSH root@$PDFSANDWICH_SERVER "rm -f /tmp/$INPUT; rm -f /tmp/$OUTPUT"

An RSA key is required to communicate both servers by using SSH without user interaction.


$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
$ ssh-copy-id -i ~/.ssh/id_rsa.pub alfresco-ocr.keensoft.es

alfresco-global.properties has to be updated to use above script when invoking pdfsandwich.


ocr.command=/opt/alfresco/scripts/ocr.sh

Alfresco is restarted to apply configuration.

$ systemctl restart alfresco

Final words

Once our system is up and ready, OCR tasks are sent by Alfresco to external OCR server, allowing to maintain Alfresco quality service.
Even increasing capacity for OCR service by using async thread pool defined at alfresco-global.properties will not impact in Alfresco server.


# Default Async Action Thread Pool
default.async.action.threadPriority=1
default.async.action.corePoolSize=8
default.async.action.maximumPoolSize=20

It happens that Alfresco quality service does not rely on what software is installed but on how that software is installed.

Anuncios

4 comentarios en “Alfresco, installing OCR as an external service

  1. Absolutely awesome solution, Angel! Many thanks for sharing this, that’s highly appreciated! :-)
    However I needed to do some modifications to use ocrmypdf which gives me much better results that pdfsandwich. Also I have tuned the script to not overload the OCR machine/VM when a user puts dozens of documents in a folder with an OCR rule.
    I’d like to share this and have described what I did and my scripts in this github comment: https://github.com/keensoft/alfresco-simple-ocr/issues/13#issuecomment-304536817

  2. that’s a great solution you provide here Angel!!
    I’ve tried to use it with a windows installation of alfresco, I’ve installed pdfsandwich on a linux server and made a bat for sending the file and call pdfsandwich, but I’ve always got a read timeout error (408) in tomcat’s log, it seems that this error occurs during the execution of the copy-node webscript. Is there some kind of known issues on a windows alfresco?

    • It should work… Probably you are using Windows parameters, use instead Linux ones in alfresco-global.properties for the addon.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s