comp.org.uk

Networking | Programming | Security | Linux | Computer Science | About

Manage PDFs from the command line

You may commonly encounter files in Adobe PDF format, especially when exchanging files with Windows or Mac OS X computers. Less commonly, you may encounter files in PostScript format, or you might need to convert files to PostScript in order to print them. Linux has a rich set of tools for working with PDF and PostScript files, even if you’re working in the shell and can’t view the files graphically. If you simply want to display PDF and PostScript files, you have a number of choices. The commands evince, okular, and gv (Ghostview) all display both types of files, and xpdf displays only PDFs. There’s also a full-featured but ancient “official” PDF viewer from Adobe, acroread, but it is no longer main‐ tained and is relatively slow. All of these programs are available on the command line. For more complex handling of PDF and PostScript files, read on.

pdftotext

pdftotext [options] [file.pdf [outfile.txt]]

The pdftotext command extracts text from a PDF file and writes it to a file. This works only if the PDF contains actual text, not images that look like text (say, a magazine article that’s been scanned on a graphical scanner).

pdftotext sample.pdf (Creates sample.txt)

Useful Options

-f N Begin with page N of the PDF file. You must have a space between the option and the number.
-l N End with page N of the PDF file. You must have a space between the option and the number.
-htmlmetaGenerate HTML rather than plain text.
-eol (dos | mac | unix)Write end-of-line characters in the text file for the given operating system.

ps2ascii

ps2ascii file.(ps|pdf)] [outfile.txt]

The ps2ascii command extracts text from a PostScript file. It’s a simple command with no options. To extract text from sample.ps and place it into extracted.txt:

ps2ascii sample.ps extracted.txt

ps2ascii can also extract text from a PDF file, though you wouldn’t guess that from the command name.

ps2ascii sample.pdf extracted.txt

pdfseperate

pdfseparate [options] [file.pdf] [pattern.txt]

The pdfseparate command splits a PDF file into separate PDF files, one per page. For example, if one.pdf is 10 pages long, then this command will create 10 PDF files named split1.pdf through split10.pdf, each containing one page:

pdfseparate one.pdf split%d.pdf

The final argument is a pattern for forming the names of the individual page files. The special notation %d stands for the extracted page number.

Useful Options

-f N Begin with page N of the PDF file. You must have a space between the option and the number.
-l N End with page N of the PDF file. You must have a space between the option and the number.

pdftk

pdftk [arguments]

pdftk is the “Swiss Army knife” of PDF commands. This versatile program can extract pages from a PDF file, join several PDFs into one, rotate pages, add watermarks, encrypt and decrypt files, and much more, all from the command line. This power comes with complicated syntax, unfortunately, but with a little effort you can learn a few useful tricks.

To join the files one.pdf and two.pdf into a single PDF file, combined.pdf:

pdftk one.pdf two.pdf cat output combined.pdf

To extract pages 5, 7, and 10–15 from the file one.pdf and write them to new.pdf:

pdftk one.pdf cat 5 7 10-15 output new.pdf

Extract the first five pages from one.pdf and the odd-numbered pages from two.pdf and combine them as combined.pdf:

pdftk A=one.pdf B=two.pdf cat A1-5 Bodd output \ combined.pdf

Copy the file one.pdf to new.pdf, but with page 7 rotated by 90 degrees clockwise (“east”):

pdftk one.pdf cat 1-6 7east 8-end output new.pdf

Interleave the pages of one.pdf and two.pdf, creating interleaved.pdf:

pdftk one.pdf two.pdf shuffle output \ interleaved.pdf

You may have noticed that the page selection criteria, typically appearing before the output keyword, are very powerful. They consist of one or more page ranges with qualifiers. A page range can be a single page like 5, a range like 5-10, or a reverse range like 10-5 (which will reverse the pages in the output). Qualifiers can remove pages from a range, like 1-100~20-25, which means “all pages from 1 to 100 except for pages 20 to 25.” They can also specify only odd pages or even pages, using the keywords odd or even, and rotations using the compass directions north, south, east, and west. We’ve only scratched the surface of pdftk’s abilities. The manpage has many more examples and full syntax.

pdfps

pdf2ps [options] file.pdf [file.ps] 
ps2pdf [options] file.ps [file.pdf]

The pdf2ps command converts an Adobe PDF file into a PostScript file (if you don’t provide an output file name, the default is to use the input filename, with .pdf replaced by .ps):

pdf2ps sample.pdf converted.ps

The command has a couple of options but they are rarely used. See the manpage if you’re interested. To go in the opposite direction, converting a PostScript file to PDF format, use ps2pdf:

ps2pdf sample.ps converted.pdf

Published on Tue 03 March 2015 by Lindsey Corbyn in Linux with tag(s): pdf command line