The following is a lists of lesser known command line tools for some common tasks in working with pdf files, in particular for preparing graphics for inclusion in latex documents and processing scanned documents.
Show information about raster images embedded in pdf
The poppler package provides the
pdfimages -list <pdffile>
shows a summary of all embedded images including their color space and compression algorithm. For scanned documents, we can check whether they have been optimally compressed. The
pdfimagesutility also allows to extract all images.
Show information about pages sizes
When all pages in the pdf have different (print) page sizes, the Imagemagick
identifyutility can extract them:
identify -verbose <pdffile> | grep 'Print size'
The result will be in inches. If you only want to know the size of the first page, the
pdfinfoutility (part of the xpdf package) will display that (in postscript points).
Extract Font Information
pdfminerPython package includes the
pdf2txt.py -p <pagenumber> -t xml <pdffile>
shows the font and font size for all text in the pdf. This is especially useful if you need to create a figure that exactly matches the text of some existing pdf. For example, when calling
pdf2txt.pyon a pdf created with Beamer, we may find the following for some letter of the main text:
<text font="KJEDUA+CMSS12" bbox="120.915,193.920,126.118,205.983" size="12.063">e</text>
Imagemagick contains the
convertutility to convert between various raster image formats, including PDF.
convert -compress Zip -density 200 <inpdf> <outpdf>
converts a (vector) pdf into a 200 DPI raster compressed pdf. The Goodnotes app exports handwritten notes to PDF in a pure vector format that is rather large. Rastering it as above compresses the file significantly (but also gets rid of the OCR layer). The resulting file can be further compressed through the “Optimize Scanned PDF” in Adobe Acrobat.