I've never been much of a sysadmin, despite being my own webmaster for decades.
For a sheet music library sideproject I've been on, I've wanted to do three things with PDF sheet music files: split (for when the user uploaded a single file with different pages for all instruments), merge (to put them back when a page might overrun it) and extract contents as text (to try and take a guess what instrument the page is for)
The first thing I found was PDFtk Server. To download it, I had to figure out which linux version I needed; Red Hat or CentOS, Linux 5 or 6, 32-bit or 64-bit. The first two were answered with
cat /etc/redhat-release
(CentOS 6) and the last with
uname -m
(64 bit unsurprisingly)
At that point I could follow the install instructions using a lot of "yum".
The man pages for pdftk indicate the syntax is a bit un-Unix-y. The call splitting "burst", and the syntax is
pdftk BIGFILE.PDF burst
which by default makes pg_0001.pdf etc.
Merging is called concatenation but that's sort of the default behavior for pdftk, so combining page 1 and 2 for example would be
pdftk pg_0001.pdf pg_0002.pdf output COMBINEDFILE.PDF
You can try using a regex to extract text data, but I wanted a proper command. I turned to
pdftotext which is part of poppler-utils -
yum install poppler-utils
That gave me pdftotext and I could do
pdftotext WHATEVERFILE.PDF
which would make WHATEVERFILE.txt, or I could do
pdftotext WHATEVERFILE.txt -
which would dump to STDOUT.
So, one weird thing is I thought poppler-utils was supposed to come with pdfunite and pdfseparate which mean I could have skipped the pdftk stuff, but for some reason those weren't included, and since I already could do what I needed with pdftk, I decided to give it a miss.
No comments:
Post a Comment