kota's memex

Turns out losslessly removing a pdf page isn't as easy as you'd think. You don't want to use your OS "file print" feature as you might originally be inclined. That can remove valuable text data and make the pdf harder to search, also depending on how the fonts are embedded it can result in the fonts changing.

I looked into a few tools for doing this. First was mupdf's mutool which has a "clean" mode that takes a range of pages and returns a cleaned up pdf in that page range. Simple enough, or so I thought. It wound up stripping out the pdf's index (the outline thing you can view in most pdf viewers). I don't know if that's intentional, but I searched the documentation and bug reports and found nothing on it. So, I looked into other tools. I found pdftk which seemed very similar with its "cat" mode. It also stripped out the index, BUT has a mode to store the index to a file and then apply said index to a pdf... so it works, but is a bit of a hack.

Here's how to remove the last page from a 400 page document with pdftk:

pdftk original.pdf cat 1-399 output tmp.pdf
pdftk original.pdf dump_data output index.info
pdftk tmp.pdf update_info index.info output out.pdf