Dated Revised
Category Tech
Tags bash · latex · pdf

The following is a lists of lesser known command line tools for some common tasks in working with pdf files, in particular for preparing graphics for inclusion in latex documents and processing scanned documents.

  • Show information about raster images embedded in pdf

    The poppler package provides the pdfimages utility.

      pdfimages -list <pdffile>

    shows a summary of all embedded images including their color space and compression algorithm. For scanned documents, we can check whether they have been optimally compressed. The pdfimages utility also allows to extract all images.

  • Show information about pages sizes

    When all pages in the pdf have different (print) page sizes, the Imagemagick identify utility can extract them:

      identify -verbose <pdffile>  | grep 'Print size'

    The result will be in inches. If you only want to know the size of the first page, the pdfinfo utility (part of the xpdf package) will display that (in postscript points).

  • Extract Font Information

    The pdfminer Python package includes the utility. -p <pagenumber> -t xml <pdffile>

    shows the font and font size for all text in the pdf. This is especially useful if you need to create a figure that exactly matches the text of some existing pdf. For example, when calling on a pdf created with Beamer, we may find the following for some letter of the main text:

      <text font="KJEDUA+CMSS12" bbox="120.915,193.920,126.118,205.983" size="12.063">e</text>

    This indicates that the Computer-Modern-Sans-Serif-12 font is used, and by installing the OTF version of the LaTeX fonts on your system, you can create a matching figure e.g. in OmniGraffle.

  • Rastering PDFs

    Imagemagick contains the convert utility to convert between various raster image formats, including PDF.

      convert -compress Zip -density 200 <inpdf> <outpdf>

    converts a (vector) pdf into a 200 DPI raster compressed pdf. The Goodnotes app exports handwritten notes to PDF in a pure vector format that is rather large. Rastering it as above compresses the file significantly (but also gets rid of the OCR layer). The resulting file can be further compressed through the "Optimize Scanned PDF" in Adobe Acrobat.