Ross Moore
2016-11-22 02:52:22 UTC
Hi all,
Iâve recently returned to tackle the task of generating Tagged PDF using pdftex,
in particular for PDF 2.0, PDF/A-1a and PDF/UA format specifications.
This is using ordinary pdftex, not the one with extra primitives specially for the
tagging structures.
So far Iâve had more success than Iâd originally thought possible.
The attached document below is an example that fully conforms to PDF/A-1a.
It also passes all the Accessibility tests in Acrobat Pro DC.
However there is some smallish issues that would be good to have implemented.
Using \pdfliteral for the tagging, to insert material before and after textual content,
1. the use of interword spaces between words;
2. the coordinate space adjustments prior to tags an BT textual content:
1 0 0 1 108.737 686 cm
/T <</MCID 0 >> BDC
1 0 0 1 -108.737 -686 cm
BT
But there is a drawback, since BT ⊠ET and BDC ⊠EMC operators
must be correctly nested, else the PDF is malformed.
viz.
It must therefore be within tags â but that cannot be achieved this way.
So here are my requests.
1. please add a new mode to \pdfliteral
e.g. \pdfliteral text {âŠ.}
which checks whether we have pdf_doing_text as true.
If so, just do what \pdfliteral direct does;
otherwise do
pdf_print_ln("BT");
pdf_doing_text := true;
then place the contents literally.
When used correctly, textual content would follow, without needing to change pdf_doing_text
nor include the initial âBTâ.
Presumably thereâll need to be an adjustment to
procedure pdf_begin_text; {begin a text section}
to not do pdf_print_ln("BTâ); when pdf_doing_text is already true.
2.
It would be great to be able to do away with the \pdfinterwordspaceon/off for every word.
That is, generate shorter output with explicit spaces (when the font has it in slot 32 ) such as:
/T <</MCID 4 >> BDC
/F15 10.9091 Tf 0 0 Td [(And )<num>(another )<num>(with )<num>(some )]TJ
EMC
where each <num> is calculated using the width of the space character in the font.
Not only does this reduce the (uncompressed) size considerably, but it would also
allow for the âReflowâ effect in Adobe Reader and Acrobat Pro (and other ?) PDF readers.
All the best,
Ross
Dr Ross Moore
Mathematics Dept | Level 2, S2.638 AHH
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ***@mq.edu.au
http://www.maths.mq.edu.au
[cid:75aa1ef5-7de8-4a72-b53d-***@ausprd01.prod.outlook.com]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.
This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.
Iâve recently returned to tackle the task of generating Tagged PDF using pdftex,
in particular for PDF 2.0, PDF/A-1a and PDF/UA format specifications.
This is using ordinary pdftex, not the one with extra primitives specially for the
tagging structures.
So far Iâve had more success than Iâd originally thought possible.
The attached document below is an example that fully conforms to PDF/A-1a.
It also passes all the Accessibility tests in Acrobat Pro DC.
However there is some smallish issues that would be good to have implemented.
Using \pdfliteral for the tagging, to insert material before and after textual content,
36 0 obj
<<
/Length 2718
stream
1 0 0 1 108.737 686 cm
/T <</MCID 0 >> BDC
1 0 0 1 -108.737 -686 cm
BT
/F15 10.9091 Tf 108.737 686 Td [(Here)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 25.788 0 Td [(is)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 10.97 0 Td [(a)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 9.091 0 Td [(paragraph)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 52.182 0 Td [(b)28(y)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 15.151 0 Td [(itself.)]TJ
ET
1 0 0 1 252.586 686 cm
EMC
âŠ
Note<<
/Length 2718
stream
1 0 0 1 108.737 686 cm
/T <</MCID 0 >> BDC
1 0 0 1 -108.737 -686 cm
BT
/F15 10.9091 Tf 108.737 686 Td [(Here)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 25.788 0 Td [(is)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 10.97 0 Td [(a)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 9.091 0 Td [(paragraph)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 52.182 0 Td [(b)28(y)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 15.151 0 Td [(itself.)]TJ
ET
1 0 0 1 252.586 686 cm
EMC
âŠ
1. the use of interword spaces between words;
2. the coordinate space adjustments prior to tags an BT textual content:
1 0 0 1 108.737 686 cm
/T <</MCID 0 >> BDC
1 0 0 1 -108.737 -686 cm
BT
1 0 0 1 -179.576 -10.095 cm
/T <</MCID 4 >> BDC
1 0 0 1 -108.737 -563.655 cm
BT
/F15 10.9091 Tf 108.737 563.655 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
ET
1 0 0 1 224.889 563.655 cm
EMC
/T <</MCID 5 >> BDC
1 0 0 1 -224.889 -563.655 cm
BT
/F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 224.889 563.655 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
ET
1 0 0 1 275.245 563.655 cm
EMC
/T <</MCID 6 >> BDC
1 0 0 1 -275.245 -563.655 cm
BT
/F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 275.245 563.655 Td [(.)]TJ
ET
1 0 0 1 283.124 563.655 cm
EMC
The length of the output can be reduced (by approx 15â20%) using \pdfliteral direct âŠ./T <</MCID 4 >> BDC
1 0 0 1 -108.737 -563.655 cm
BT
/F15 10.9091 Tf 108.737 563.655 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
ET
1 0 0 1 224.889 563.655 cm
EMC
/T <</MCID 5 >> BDC
1 0 0 1 -224.889 -563.655 cm
BT
/F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 224.889 563.655 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
ET
1 0 0 1 275.245 563.655 cm
EMC
/T <</MCID 6 >> BDC
1 0 0 1 -275.245 -563.655 cm
BT
/F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 275.245 563.655 Td [(.)]TJ
ET
1 0 0 1 283.124 563.655 cm
EMC
But there is a drawback, since BT ⊠ET and BDC ⊠EMC operators
must be correctly nested, else the PDF is malformed.
viz.
36 0 obj
<<
/Length 2232
stream
âŠ
...
BT
/F17 1 Tf 108.737 563.655 Td [( )]TJ
/T <</MCID 4 >> BDC
/F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
EMC
/T <</MCID 5 >> BDC
/F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 27.333 0 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
EMC
/T <</MCID 6 >> BDC
/F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 22.129 0 Td [(.)]TJ
EMC
ET
Note here that \pdffakespace is used immediately before the first \pdfliteral direct<<
/Length 2232
stream
âŠ
...
BT
/F17 1 Tf 108.737 563.655 Td [( )]TJ
/T <</MCID 4 >> BDC
/F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
EMC
/T <</MCID 5 >> BDC
/F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 27.333 0 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
EMC
/T <</MCID 6 >> BDC
/F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 22.129 0 Td [(.)]TJ
EMC
ET
/T <</MCID 4 >> BDC
BT
/F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
EMC
/T <</MCID 5 >> BDC
...
However that âfake spaceâ is viewed as content, for Accessibility purposes.BT
/F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
EMC
/T <</MCID 5 >> BDC
...
It must therefore be within tags â but that cannot be achieved this way.
So here are my requests.
1. please add a new mode to \pdfliteral
e.g. \pdfliteral text {âŠ.}
which checks whether we have pdf_doing_text as true.
If so, just do what \pdfliteral direct does;
otherwise do
pdf_print_ln("BT");
pdf_doing_text := true;
then place the contents literally.
When used correctly, textual content would follow, without needing to change pdf_doing_text
nor include the initial âBTâ.
Presumably thereâll need to be an adjustment to
procedure pdf_begin_text; {begin a text section}
to not do pdf_print_ln("BTâ); when pdf_doing_text is already true.
2.
It would be great to be able to do away with the \pdfinterwordspaceon/off for every word.
That is, generate shorter output with explicit spaces (when the font has it in slot 32 ) such as:
/T <</MCID 4 >> BDC
/F15 10.9091 Tf 0 0 Td [(And )<num>(another )<num>(with )<num>(some )]TJ
EMC
where each <num> is calculated using the width of the space character in the font.
Not only does this reduce the (uncompressed) size considerably, but it would also
allow for the âReflowâ effect in Adobe Reader and Acrobat Pro (and other ?) PDF readers.
All the best,
Ross
Dr Ross Moore
Mathematics Dept | Level 2, S2.638 AHH
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ***@mq.edu.au
http://www.maths.mq.edu.au
[cid:75aa1ef5-7de8-4a72-b53d-***@ausprd01.prod.outlook.com]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.
This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.