From paper to PDF with OCR text on Linux in your terminal (with Fujitsu ScanSnap)

Posted on Thu 03 January 2019 in how-to

Switching from a Mac with osx to Linux can be though. Especially when it comes to scanning. Many years ago I had some first interactions with the SANE project, which is the solution for scanning under Linux.

As scanning with my Fujitsu ScanSnap has been quite comfortable and not quite FOSS with osx and the vendors software it's time to migrate the results to proper utlities!

sane setup for ScanSnap scanners

After installing sane-frontends and sane-backends via dnf it's time to install the drivers. This page provides download links. You can copy the file to /usr/share/sane/epjitsu/ with sudo after creating the directory.

The rest of the sane setup is covered in various other blog posts, e.g. the one linked before. I won't cover the setup in this post.

PDF and OCR - where to start?

As you may have discovered it's though to find any current and still working solutions for converting your scans to PDF and add OCR text data. Some of the tutorials are simply outdated, others refer to graphical applications.

I've been searching for solutions myself and here are some notes: scanning under Linux is still a mess GUI applications are easier to find pdfocr looks simple, but not a solution (as it requires a broken dependency called pdftk) tesseract is great! * many projects that cover the issue partially are not maintaned anymore * as always it's useful to verify last commit and release dates

After reading quite some blog posts, scripts, commit histories and so on I've decided that my solution must be build with tools I can install with dnf or pip3 to make sure security updates can be installed easily and to have a chance that the projects are available in a later Fedora version as well.

Installing dependencies - using the script

Before using the script some dependncies should be installed:

dnf install tesseract tesseract-osd tesseract-langpack-deu ocrmypdf netpbm-utils ghostscript fish

If you're only scanning documents with English text tesseract-langpack-deu is not required. It can also be switched for other languages, e.g. Russian with the package tesseract-langpack-rus.

You can find the script from this link on GitHub or copy it from this page. After the download adjust the device id and give it a try.

#!/usr/bin/fish

# change this to your device id, see scanimage -L for a list of your devices
set -x device 'epjitsu:libusb:001:019'

# exit if no title is provided
if not set -q argv[1]:
    echo "Please enter at least a title!"
    exit 1
end

# simply convert arguments to variables
if set -q argv[1]
    set -x title (echo $argv[1])
end

if set -q argv[2]
    set -x resolution (echo $argv[2])
else
    set -x resolution 300
end

if set -q argv[3]
    set -x mode $argv[3]
else
    set -x mode Gray
end

if set -q argv[4]
    set -x destinationdir $argv[4]
else
    set -x destinationdir '/home/rullmann/Downloads'
end


# create temporary dir with variable name
set -x tempdir /tmp/scan_(tr -dc 'a-z0-9' < /dev/urandom | head -c 32)
mkdir $tempdir

# create output filename for final pdf by converting the title and adding the date
set -x outputfile $destinationdir/(date +%F)_(echo $title | sed -e 's/\(.*\)/\L\1/' -e 's/\ /_/g').pdf

# actually scan and process the input
scanadf -d $device --resolution $resolution --mode $mode -o $tempdir/%d ;and for file in (ls $tempdir/) ; pnmtops $tempdir/$file ; end | ps2pdf - | ocrmypdf -l deu+eng --rotate-pages --deskew - $outputfile --title "$title"

# remove temp dir
rm -r $tempdir