kota's memex

remove a page

Turns out losslessly removing a pdf page isn't as easy as you'd think. You don't want to use your OS "file print" feature as you might originally be inclined as that will remove valuable text data and make the pdf harder to search, also depending on how the fonts are embeded it can result in the fonts changing.

I looked into a few tools for doing this. First was mupdf's mutool which has a "clean" mode that takes a range of pages and returns a cleaned up pdf in that page range. Simple enough, or so I thought. It actally wound up stripping out the pdf's index (the outline thing you can view in most pdf viewers). I don't know if that's intensional, but I searched the documentation and bug reports and found nothing on it. So I looked into other tools. I found pdftk which seemed very similar with its "cat" mode. It also stripped out the index, BUT has a mode to store the index to a file and then apply said index to a pdf.... meaning it works, but is a bit of a hack.

Here's how to remove the last page from a 400 page document for example.

pdftk original.pdf cat 1-399 output tmp.pdf
pdftk original.pdf dump_data output index.info
pdftk tmp.pdf update_info index.info output out.pdf

combine lots of pdfs

First combine the PDFs using ghostscript, then use pdftk to generate and apply the index pages (if you care about that).

#!/bin/bash

shopt -s globstar
# generate big combined pdf with ghostscript
fd -e pdf -0 \
	| xargs -0 gs -dNOPAUSE -sDEVICE=pdfwrite \
	-sOUTPUTFILE=combine.pdf -dBATCH
shopt -u globstar
# create a tree of .info files for each pdf under /tmp/
for f in "${pdfs[@]}"; do
  dir=$( dirname "$f" )
  name=$( basename -s ".pdf" "$f" )
  mkdir -p /tmp/"$dir"
  printf "%s\0" "$f" | xargs -0 -I "{}" pdftk "{}" dump_data output /tmp/"$dir"/"$name".info
done