The Zoef approach to PDF
Norbert Ligterink, Control Engineering, University Twente, The Netherlands,
cr:17 Feb. 2003, ch:27 Feb. 2003
The Zoef approach to PDF is not a manual, or tutorial, or accurate, or anything
honorable. It is my ranting notes on trying to understand PDF and hack it
like I used to with Postscript. You might find things here you cannot find
anywhere else, because, after futile searches on the web I started to edit
PDF myself and see what one can do with the monster.
The Zoef approach is named after the dutch folk hero Zoef de Haas, fast
as lightning but a bit sloppy.
Why PDF?
The PDF specifications are publicly available, and using it is license free
(more or less). And slowly it seems to become the standard. But most importantly,
it is not Microsoft. Ever tried living in a fascist society? Microsoft is
the modern fascism. Sending me or requesting a *.doc, *.ppt, or a *.xls is just another
form of oppression of free speech.
PDF Problems
There are several things that make dealing with PDF hard.
1) The BYTE COUNT: PDF seems to be designed by somebody with a tape drive
for a brain. At the bottom there is a reference table by byte address for
random access of the file. However, the count start at the beginning, and
the first byte is 0, in C logic. So if you try to edit PDF by hand, make
sure the byte count remains the same, or face updating the reference table.
2) BINARY CODE: PDF allows for stream data, which is unformatted and generally
binary. Although people warn against this bad practice. So when you perform
global replacements and so, be sure not to touch the stream data.
3) CROSS PLATFORM: since PDF is made and edited on UNIX, Mac, and the Unspeakable,
it has all combinations of LF (\n) and CR (\r). A simple replacement destroys
the integrity, and deleting CR changes the BYTE COUNT. I use at the moment:
s/\r/\n/g
s/\n\n/ \n/g
4) IMPLICIT CROSS REFERENCES and SHARED RESOURCES: PDF is made out of objects,
which are made out of other objects, and most of the time trying to read
PDF code you will spend tracing the objects, and the dependencies. PDF has
no hash table or dictionary to aid this. Actually the reference table is
just a piece of tape-drive-brain junk at the bottom.
5) REFERENCE MANUAL: it is a lyrical piece, going on about typesetting and
clipping, but poor on actual information. The examples are infuriating ambiguous.
And for people who give such a high priority to annotations, they can't be
bothered to provide any comments. To figure out the markers I had to count
the bytes myself.
6) ANNOTATIONS: The actual content (i.e. text and images) is only the third
sublayer in PDF. You start at the second line from the bottom, which has
a f**** BYTE COUNT to the head of the reference table "xref", which has a
BYTE COUNT to "Root" which has references to "obj" numbers of "Pages" and
"Annotations", and each of these have a "Type" variable with an argument.
However, which object is "Root" you only know by reading the trailer before
the end and after the "xref". It seems somebody wanted to implement Knuthy
stuff (linked lists etc.) to the point of being religious about it.
7) MONEY: everybody wants to make money with PDF. The only decent freeware
seems to be pdflatex, which I adore. But I want to be able to do more. I
wasted money on buying Adobe Acrobat, which allows you to do next to nothing
and just seems to be promoting the new features of higher releases of PDF.
Most of the shareware stuff I tried seems pretty crap, and tuned towards
endusers. Most annoying hardly any system READS pdf in a decent way, except
for the minimal alterations.
8) FILTERS: Please anybody tell me how to program the Filters in a decent
way, to make stuff readable, instead of encoded, manipulations starts at
reading.
PERL? PERL! PERL!!!
I think that perl would be great to tackle PDF. The objects stand on their
own, except for the BYTE COUNT and SHARED RESOURCES. So read-write most of
the stuff, and change the things one can, and update the tables.
I see a number of little programs:
oinfo.pl: print object info and dependencies.
updat.pl: makes a new xref table and trailer.
indel.pl: insert and delete pages.
chnbb.pl: changes the MediaBox (bounding box).
insim.pl: insert an image.
extri.pl: extract an image.
clean.pl: change to unix format.
rotat.pl: rotate an obj.
trans.pl: translate non-binary parts into ascii.
Someday I will get around to it. Here is a little example of chnbb.pl without updating the tables; it just pads the
file with blanks, but fails when the ouput file is longer than the input
file: (unix format, otherwise more than one object might be stringed together,
with CR (\r) between them.)
#!/usr/bin/perl
if($#ARGV ne 5){
print "TO CHANGE the bounding box\n";
print "USAGE: chnbb.pl xl yl xu yu infile.pdf outfile.pdf\n";
}else{
$infile = $ARGV[4];
open(Lista,"&#$infile")||die "Can't open the file";
$outfile = $ARGV[5];
open(OUT,">$outfile")||die "Can't open the file";
#
##################################
#
$count = 0;
@listind = < Lista> ;
select(OUT);
for $i (@listind){
$ll = length($i);
$count += $ll;
$i =~ s/MediaBox \[[^\]]*\]/MediaBox \[$ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3]\]/;
if(length($i) < $ll){
$i = $i." " x ($ll - length($i));}
print $i;
}
close(Lista);
close(OUT);
}
clpdf.pl cleans out the CR, \r, or ^M from
the file, so that it can more easily be attacked by brute
force perlocity.
#!/usr/bin/perl
if($#ARGV ne 1){
print "TO remove CR with LF outside stream\n";
print "USAGE: clpdf.pl infile.pdf outfile.pdf\n";
}else{
$infile = $ARGV[0];
$outfile = $ARGV[1];
#
open(Lista,"< $infile")||die "Can't open the file";
open(OUT,"> $outfile")||die "Can't open the file";
#
# according to the pdf spec the keyword
# "stream" should be followed by a CR \r and a LF \n or
# just a LF. So there can only be one stream and/or one
# endstream one a line.
#
##################################
#
@listind = < Lista>
#
$off = 1;
#
select(OUT);
for $i (@listind){
if($i =~ /(.*)endstream(.*)/){
#
# $st is stream, $as is ascii
#
$st = $1;
$as = $2;
$as =~ s/\r/\n/g;
#
# substitution patterns remove the \n
# (don't use chop because \n\n should be replaced with a blank\n
# in for example the xref)
#
$as = $as."\n";
$as =~ s/\n\n/ \n/g;
$i = $st."endstream".$as;
#
# if a new stream appears turn off the global subsitution
#
if($as =~ /stream/i){
$off = 0;
}else{
$off = 1;
}
}else{
#
# in sentences without "stream" rely on $off and check for "stream"
#
if($off eq 1){
$i =~ s/\r/\n/g;
$i =~ s/\n\n/ \n/g;
}
if(($i =~ /[^d]stream/i) || ($i =~ /^stream/i)){
$off = 0;
}
}
print $i;
}
close(Lista);
close(OUT);
}
Quick and dirty:
extracting pages from PDF:
acroread -toPostScript -start $1 -end $2 < $3 > tmp.ps; ps2pdf tmp.ps $4;rm tmp.ps
uppdf.pl
This is the program that does all the counting,
if you hack a PDF in a text editor: remove a page, swap
pages, change a bounding box, run this program
update the xref table and the startxref address at the
end. If there is more than one xref, this program might
fail without warning, however, it attempts to construct
a single xref table out of multiple tables.
Here some stuff you might find inside a PDF file, and what
it means.
$number1 $number2 obj
.....
endobj
$number1 is the object number, $number2 the generation number (in a fresh
PDF it is usually 0). It is the identifier of an object following below
till "endobj" at the end.
The first byte of $number1 is the BYTE COUNT address of the object.
$number1 $number2 R
a reference to the object above, basically: the INSERT obj HERE command.
stream
...................
endstream
some stream data, usually everything useful about the contents (text, images,
fonts) encoded as binary. It is important to know that the keyword
" stream should always be followed by a \n, possibly as: \r\n.
it appears inside an object like:
$obj_number 0 obj
<< .... >>
stream
.......
endstream
endobj
where "<< ...>>" should contain useful information like the
"/Length" and the type of "/Filter" or "/Encoding".
%PDF-1.3
0226 0227 0207 0211
the header of a PDF file, where the numbers are the ascii codes.
<< $X1 $Y1 $X2 $Y2 >> a dictionary, generally a set
of pairs with multiple functions, e.g., NEWCOMMAND: $X1 $Y1 can be a "/newname
argument" pair.
[ x y ... z ] an array, for example for a composite argument, or
a list of widths for fonts.
xref
0000000000 65535 f
$nn $nn
0000025190 00000 n
$nnnnnnnn1 $nnn2 n
$nnnnnnnn3 $nnn4 f
trailer
<<
/Size number_of_objs
/ID bla_bla
/Root obj_number_root obj_gen_number R
....
>>
startxref
reference_point_xref_of_root_BYTE_COUNT
%%EOF
The tail of the file. For a "functioning object" the identifier is "n" at
the end with a trailing blank!, $nnnnnnnn1 is the BYTE COUNT location, $nnn2
the generation number. For an empty object number the identifier is "f",
$nnnnnnnn3 and $nnn4 construe a linked list with 0000000000 65535 f as the
top element with fixed format, $nnnnnnnn3 the object pointer and $nnn4 the
generation number. Sometimes there are $nn $nn numbers present, these are
subsection annotations and so.
The twenty character lines (including
LF (\n)) are all the objects starting with object 0 (empty), object 1, etc.
Note, there might be several tails in the file; amendments can be added
to the end. (MicroSods Word produces such stuff.)
Changing pages is quite simple:
The structure of a PDF starts with a root object, to which
the trailer points: /Root 1 0:
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
The object Pages contains references to all the pages:
<<
2 0 obj
/Type /Pages
/Kids [ 3 0 R 6 0 R .....@objnumber[$pagenumber]
@gennumber[$pagenumber] R ]
/Count $number
>>
swapping two entries $objnumber $gennumber R
will swap the respective pages.
The simplest jpeg image in PDF form is given by:
%PDF-1.2
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/Font << /F0 5 0 R>>
/XObject << /Im0 6 0 R >>
/ProcSet [ /PDF /Text /ImageC ] >>
/MediaBox [0 0 612 792] USletter
/CropBox [ x y (x+w) (y+h)] "position"
/Contents 4 0 R
>>
endobj
4 0 obj
<<
/Length 35
>>
stream
q
w 0 0 h x y cm "width + position"
/Im0 Do
Q
endstream
endobj
5 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F0
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
6 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
/Filter [ /DCTDecode ]
/Width w "image width"
/Height h "image height"
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Length "the length of the jpg file"
>>
stream
... here goes the whole jpg file as a stream ...
endstream
xref
0 7
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000118 00000 n
0000000329 00000 n
0000000413 00000 n
0000000521 00000 n
trailer
<<
/Root 1 0 R
/Size 9
>>
startxref
697 + length jpg (+/- a few bytes)
%%EOF
make sure you update the image size and the xref table.
Even an image requires font resources it seems. I edited this
PDF from a slightly longer one generated by Imagick.
(shell# convert image.jpg image.pdf).