Tuesday, August 7, 2018

manipulating pdf files (and playing jr. sysadmin)

I've never been much of a sysadmin, despite being my own webmaster for decades.

For a sheet music library sideproject I've been on, I've wanted to do three things with PDF sheet music files: split (for when the user uploaded a single file with different pages for all instruments), merge (to put them back when a page might overrun it) and extract contents as text (to try and take a guess what instrument the page is for)

The first thing I found was PDFtk Server. To download it, I had to figure out which linux version I needed; Red Hat or CentOS, Linux 5 or 6, 32-bit or 64-bit. The first two were answered with
cat /etc/redhat-release
(CentOS 6) and the last with
uname -m
(64 bit unsurprisingly)

At that point I could follow the install instructions using a lot of "yum".

The man pages for pdftk indicate the syntax is a bit un-Unix-y. The call splitting "burst", and the syntax is
pdftk BIGFILE.PDF burst
which by default makes pg_0001.pdf etc.

Merging is called concatenation but that's sort of the default behavior for pdftk, so combining page 1 and 2 for example would be
pdftk pg_0001.pdf pg_0002.pdf output COMBINEDFILE.PDF

You can try using a regex to extract text data, but I wanted a proper command. I turned to
pdftotext which is part of poppler-utils -
yum install poppler-utils

That gave me pdftotext and I could do
pdftotext WHATEVERFILE.PDF
which would make WHATEVERFILE.txt, or I could do
pdftotext WHATEVERFILE.txt  -
which would dump to STDOUT.

So, one weird thing is I thought poppler-utils was supposed to come with pdfunite and pdfseparate which mean I could have skipped the pdftk stuff, but for some reason those weren't included, and since I already could do what I needed with pdftk, I decided to give it a miss.

No comments:

Post a Comment