Discussion:
[pdftex] Please make the CreationDate, ModDate and ID field deterministic
Maria Valentina Marin
2015-07-11 10:33:04 UTC
Permalink
Hello,

I wanted to expand on the thread started by Nicolas Boulenguez:

http://tug.org/mailman/htdig/pdftex/2015-May/008940.html

Where they explain ways to make pdftex produce reproducible output.

I propose the attached patch which does not change the default behaviour
of pdftex but if the environment variable SOURCE_DATE_EPOCH is set it
causes pdftex to produce reproducible PDF files by modifying the
behaviour of the function initstarttime() and printID().

The environment variable SOURCE_DATE_EPOCH contains the unix epoch as an
integer [1]. The function printID was modified to obtain the time as in
the patch from Nicolas Boulenguez though in contrast to their patch the
ID still uses the output build directory as part of its hash. This was
done because the Debian reproducible builds team decided to not change
the path between builds, this makes the build path deterministic by
default. As far as Debian goes stripping off the path is not required.
Though we will not complain if you do =)

This environment variable was introduced by the Debian reproducible
builds team but it is meant to be used by any distribution. The package
help2man is now supporting this and we are also in the process of
persuading txt2man, epydoc, GCC, Doxygen and libxslt.

We are successfully using in our framework a modified version of pdftek
which includes this patch to build packages and test for reproducibility.

Thanks!
akira

P.S. I am starting a new thread because I could not find a way to reply
to the one Nicolas started.

[1] https://wiki.debian.org/ReproducibleBuilds/TimestampsProposal
Karl Berry
2015-07-11 22:02:02 UTC
Permalink
Hi,

I propose the attached patch which does not change the default behaviour
of pdftex but if the environment variable SOURCE_DATE_EPOCH is set it

I'm not sure.

Meanwhile, Thanh has already made changes in the pdftex source
repository, revision 724 at
https://foundry.supelec.fr/scm/viewvc.php/branches/stable/source/?root=pdftex
(project page: https://foundry.supelec.fr/projects/pdftex/)

I append his message about it (sent privately). Will it suffice for
you? --karl

--------------------------------------------
From: The Thanh Han
[...]
2 new commands were added. Usage examples:

\pdfinfoomitdate=1
% tell pdftex not to write the default CreationDate and ModDate;
% however custom values specified via \pdfinfo{} will always be outputted

\pdftrailerid{} % tell pdftex to omit trailer id, or
\pdftrailerid{abc} % provide custom value for trailer id

A minimal test file was included in the commit:
https://foundry.supelec.fr/scm/viewvc.php/branches/stable/tests/03-deterministic-output/?root=pdftex

Regards,
Thanh
Maria Valentina Marin
2015-07-12 12:29:24 UTC
Permalink
Hi,
Post by Karl Berry
Meanwhile, Thanh has already made changes in the pdftex source
repository, revision 724 at
https://foundry.supelec.fr/scm/viewvc.php/branches/stable/source/?root=pdftex
(project page: https://foundry.supelec.fr/projects/pdftex/)
I append his message about it (sent privately). Will it suffice for
you? --karl
It suffices, but it now shifts the problem somewhere else. The
SOURCE_DATE_EPOCH environment variable is an easy way for distributions
like Debian, openSUSE, and Ubuntu to set that variable during all their
builds and immediately get the supporting packages reproducible.

If pdftex does not support that environment variable but instead only
offers latex macros to make the timestamps and ID field reproducible
then we now have to fix all tools that make use of pdflatex. In Debian
616 source packages directly or indirectly build-depend on pdflatex.

Surely many of them can be fixed by fixing other toolchain packages like
Doxygen-latex but others will have to be manually patched.

We also often get a response like "why do I have to patch my software?,
can't that other software be changed instead?". If you decide not to
integrate support for SOURCE_DATE_EPOCH then we would appreciate if you
could layout your reasoning for such decision so that we can refer other
package maintainers to that email.

Thank you for your consideration,

Cheers,
akira
Nicolas Boulenguez
2015-07-13 14:08:51 UTC
Permalink
Hello all.

As the original submitter, I would have liked an opportunity to
discuss the choices, or else a word of explanation about them, or at
least an alert about the commit, as I did mention that I was not
reading the list [1].

I understand that upstream authors of a widely used software need to
discuss privately before announcing changes, but now that you are at
the coding stage, please consider answering a few questions about the
selected design.

The CreationDate and ModDate are optional. Why set them, unless
\pdfinfo asks for that?

If you set them by default, and we do not find a work-around like [2],
everyone concerned with reproducible builds will need to insert
\pdfinfo{/CreationDate(D:19900101000000Z00'00')/ModDate(D:19900101000000Z00'00')}
or
\pdfinfo{/CreationDate(D:$DATEZ00'00')/ModDate(D:$DATEZ00'00')}
into every TeX source during each build, or patch pdftex for its own
use. What concrete need does \pdfinfoomitdate answer?

The ID field is optional. Why set it, unless \pdftrailerid or similar
asks for that?

If you set it by default, the following suggestion may help doing so
without a new \pdftrailerid macro.

As I understand section 10.3 of the PDF specification [3], a md5 sum
of the time, directory and base name is only a very vague
implementation suggestion. Any randomly generated string, or
non-cryptographic hash sum of the PDF file would satisfy the purpose
better, with the latter being reproducible.

For example, would you consider defining the ID as an 8 characters
hexadecimal represention of a 32 bits XOR sum of all contents written
to the output file so far? The performance impact and collision risk
would both be very small.

Is there any benefit in keeping the current implementation by default?

In the current implementation, what is the benefit of taking gmtime()
into acount for the md5 sum, instead of CreationDate?

[1] http://tug.org/mailman/htdig/pdftex/2015-May/008940.html
[2] http://tug.org/pipermail/pdftex/2015-July/008954.html
[3] http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
Karl Berry
2015-07-13 22:39:50 UTC
Permalink
Akira, Nicolas -

Thanh is away for ~3 weeks. He will review both the SOURCE_DATE_EPOCH
patch (which I suspect will be fine) and Nicolas's other comments when
he's back.

The current code is not setting anything in stone. It was a first shot,
so we could try to garner feedback from some TeX people who chimed in on
the conversation (elsewhere).

Best,
Karl
Maria Valentina Marin
2015-08-12 11:08:35 UTC
Permalink
Hi,
Post by Karl Berry
Thanh is away for ~3 weeks. He will review both the SOURCE_DATE_EPOCH
patch (which I suspect will be fine) and Nicolas's other comments when
he's back.
In addition to my patch to honour $SOURCE_DATE_EPOCH please find
attached an additional patch which uses UTC in the printed timestamps to
also make the timezone reproducible.

I have patched the function makepdftime to use gmtime if
$SOURCE_DATE_EPOCH is set. Otherwise the old behaviour will be kept.

I have tested the patch in our autobuilders against 4 Debian packages
that use pdflatex and these become reproducible.

Cheers,
akira
Maria Valentina Marin
2015-12-01 09:37:16 UTC
Permalink
Hi,

I was wondering what has happened with this proposal since I did not get
any reply since August.
Post by Maria Valentina Marin
Post by Karl Berry
Thanh is away for ~3 weeks. He will review both the SOURCE_DATE_EPOCH
patch (which I suspect will be fine) and Nicolas's other comments when
he's back.
In addition to my patch to honour $SOURCE_DATE_EPOCH please find
attached an additional patch which uses UTC in the printed timestamps to
also make the timezone reproducible.
I have patched the function makepdftime to use gmtime if
$SOURCE_DATE_EPOCH is set. Otherwise the old behaviour will be kept.
I have tested the patch in our autobuilders against 4 Debian packages
that use pdflatex and these become reproducible.
Thanks!
akira
Akira Kakuto
2015-12-01 22:55:10 UTC
Permalink
Dear Maria,
Post by Maria Valentina Marin
I was wondering what has happened with this proposal since I did not get
any reply since August.
Thanh applied your SOURCE_DATE_EPOCH patch on 18, August.

Thanks,
Akira Kakuto

Loading...