Discussion:
[pdftex] Consider removing dependence of PDF ID field on current directory name
Anders Kaseorg
2017-09-02 05:52:43 UTC
Permalink
For the Debian Reproducible Builds effort, I’ve been debugging the
nondeterministic behavior of pdftex when invoked by dblatex. I think the
last remaining issue involves the PDF ID field, which pdftex generates by
hashing the current time and the full path to the input file (function
printID). Previous discussion
(https://tug.org/pipermail/pdftex/2015-May/008940.html,
https://tug.org/pipermail/pdftex/2015-July/008952.html) has led to support
for the SOURCE_DATE_EPOCH environment variable, which nicely controls the
time nondeterminism. This leaves the output depending on the input path.

For many packages that’s sufficient as Debian does not (yet?) require
determinism under build path variation in its definition of
reproducibility. However, dblatex invokes pdflatex on generated input
within a randomly named temporary directory. That makes it hard for
packages using dblatex to build reproducibly, even when the main build
path is fixed, without resorting to per-package patches to remove the ID
field.

Earlier it was mentioned that the algorithm used by pdftex’s printID was
inspired by the section “File Identifiers” in the PDF Reference, which
suggests hashing the time, pathname, file size, and document information
dictionary. However, a note in an appendix makes it clear that the
particular algorithm is unimportant:
“Note that the calculation of the file identifier need not be
reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of this algorithm might use
different formats for the current time, causing them to produce different
file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.”
(https://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_1-7.pdf,
Appendix H, implementation note 163)

With that in mind, could printID be changed to avoid depending on the
current directory name, either by default, or if the default won’t be
changed, then perhaps just when a reproducible build has been requested
via the presence of SOURCE_DATE_EPOCH?

Anders
Karl Berry
2017-09-02 16:43:40 UTC
Permalink
With that in mind, could printID be changed to avoid depending on the
current directory name, either by default

I think we can change it by default. Patches welcome. Else I will get to
it when I have a chance, which won't be for quite some time. --thanks, karl.
Anders Kaseorg
2017-09-02 20:45:50 UTC
Permalink
Post by Anders Kaseorg
With that in mind, could printID be changed to avoid depending on the
current directory name, either by default
I think we can change it by default. Patches welcome.
Alright. How does this look?

Anders


diff --git a/source/src/texk/web2c/pdftexdir/ChangeLog b/source/src/texk/web2c/pdftexdir/ChangeLog
index 116541e8..a5ebe6ea 100644
--- a/source/src/texk/web2c/pdftexdir/ChangeLog
+++ b/source/src/texk/web2c/pdftexdir/ChangeLog
@@ -1,3 +1,9 @@
+2017-09-02 Anders Kaseorg <***@mit.edu>
+
+ * utils.c (printID): Do not hash the current directory name into
+ the PDF ID field, since any randomness in it would lead to
+ non-reproducible builds.
+
2017-03-16 Pali Roh\'ar <***@gmail.com>

Allow .enc files for bitmap fonts, following thread at
diff --git a/source/src/texk/web2c/pdftexdir/utils.c b/source/src/texk/web2c/pdftexdir/utils.c
index 67ff8e9d..fda97666 100644
--- a/source/src/texk/web2c/pdftexdir/utils.c
+++ b/source/src/texk/web2c/pdftexdir/utils.c
@@ -697,9 +697,10 @@ void unescapehex(poolpointer in)
</blockquote>
This stipulates only that the two IDs must be identical when the file is
created and that they should be reasonably unique. Since it's difficult
- to get the file size at this point in the execution of pdfTeX and
- scanning the info dict is also difficult, we start with a simpler
- implementation using just the first two items.
+ to get the file size at this point in the execution of pdfTeX, scanning
+ the info dict is also difficult, and any randomness in the current
+ directory name would lead to non-reproducible builds, we start with a
+ simpler implementation using just the current time and the file name.
*/
void printID(strnumber filename)
{
@@ -707,29 +708,13 @@ void printID(strnumber filename)
md5_byte_t digest[16];
char id[64];
char *file_name;
- char pwd[4096];
/* start md5 */
md5_init(&state);
/* get the time */
initstarttime();
md5_append(&state, (const md5_byte_t *) start_time_str, strlen(start_time_str));
/* get the file name */
- if (getcwd(pwd, sizeof(pwd)) == NULL)
- pdftex_fail("getcwd() failed (%s), path too long?", strerror(errno));
-#ifdef WIN32
- {
- char *p;
- for (p = pwd; *p; p++) {
- if (*p == '\\')
- *p = '/';
- else if (IS_KANJI(p))
- p++;
- }
- }
-#endif
file_name = makecstring(filename);
- md5_append(&state, (const md5_byte_t *) pwd, strlen(pwd));
- md5_append(&state, (const md5_byte_t *) "/", 1);
md5_append(&state, (const md5_byte_t *) file_name, strlen(file_name));
/* finish md5 */
md5_finish(&state, digest);
Paul Vojta
2017-09-05 19:39:44 UTC
Permalink
Post by Anders Kaseorg
Post by Anders Kaseorg
With that in mind, could printID be changed to avoid depending on the
current directory name, either by default
I think we can change it by default. Patches welcome.
Alright. How does this look?
Anders
I would suggest s/randomness/variability/ , since people often build in different
directories but don't often do so in directories whose names are chosen using
randomness per se (e.g., mktemp(1)).

--Paul Vojta, vojta at math dot berkeley dot edu

Loading...