Discussion:
[pdftex] Re: [Cjk] Conflict between pinyin and pifont?(About CJKbookmarks)
Edward G.J. Lee
2006-02-13 11:50:04 UTC
Permalink
[Cc to pdfTeX list also. Feel free to modify email Subject if need.]
Under CJK UTF8 environment when use CJKbookmarks and unicode option
of hyperref it leave UTF-8 encoding characters alone.
No, you cannot use option unicode then. If there are mixed
bookmarks, you can use \hypersetup to switch the behaviour
of hyperref. However option unicode must also be added to
\usepackage[unicode]{hyperref}
\hypersetup{unicode=false,CJKbookmarks=true}
Under this situation, pdf outline will lost 0xFEFF(BOM) and it will
be UTF-8 hexadecimal if I use \texorpdfstring.

http://edt1023.sayya.org/tex/tmp/utf8bks0.tar.gz

I must let it `unicode=true,CJKbookmarks=true' to reserve the octal
UTF16BE and insert 0xFEFF(octal \376\377) automatically.

http://edt1023.sayya.org/tex/tmp/utf8bks1.tar.gz

If we don't use \texorpdfstring then the UTF-8 characters will be
in the pdf outline and use UTF-8 hexadecimal, of course it's wrong.
But is it possible change the PDF outlines' encoding to UTF-16BE
via hyperref or CJK itself?
I am not a CJK expert, something for Werner.
The problem will be the recodings, I don't think someone wants to
implement something like
&Encode::from_to($char, "Big5", "UCS-2");
at TeX macro level.
hyperref offers two hooks where the outline strings can be
* \pdfstringdefPreHook: This hook is used before the
string is expanded and is mainly used for redefinitions;
I recommend to use the following wrapper to add something
\pdfstringdefDisableCommands{%
\def\nastyMacro{nice contents}%
}%
Thanks for the hint.

But I don't think I can write the TeX macro to convert the encoding
to UTF16BE [yet]. :)
* \pdfstringdefPostHook#1: #1 contains the macro with the
expanded bookmark string. Thus the bookmark string
can be postprocessed.
Also you can make feature requests for encoding conversions
to the projects pdfTeX and/or ExTeX.
Actually pdfTeX should handle cjk pdf characters copy&search&paste(
just like dvipdfmx dose) and [maybe] cjk pdf outline(I'm not sure if
the encoding conversions should be the built-in of pdfTeX).

I also useing pdflatex to compile the same document,

http://edt1023.sayya.org/tex/tmp/utf8bks2.tar.gz

As you can see, no copy&search&paste on cjk characters even you use
asian version of acroread. And use Type 1 not Type 1 compact, so the
file is larger than dvipdfm[x]/dvips/ps2pdf produced, it's significant
in cjk document.



Edward
Werner LEMBERG
2006-02-14 20:37:27 UTC
Permalink
Post by Edward G.J. Lee
Under this situation, pdf outline will lost 0xFEFF(BOM) and it
will be UTF-8 hexadecimal if I use \texorpdfstring.
I'll fix that. This is, I'll redefine my UTF-8 macros to emit
UTF-16BE (complete with surrogates). \pdfstringdefPreHook then writes
out the BOM and switches to the redefined macros. In combination with
my CJKutf8 package this will become quite powerful, I think.


Werner
Werner LEMBERG
2006-02-16 07:59:22 UTC
Permalink
Post by Edward G.J. Lee
Under this situation, pdf outline will lost 0xFEFF(BOM) and it will
be UTF-8 hexadecimal if I use \texorpdfstring.
http://edt1023.sayya.org/tex/tmp/utf8bks0.tar.gz
Here is my solution to make hyperref directly emit UTF-16BE bookmarks.
Please note that the surrogate stuff isn't tested yet (there might be
errors which should be easy to fix nevertheless).

Attached is a simple input file CJKutf8-test.tex (using a traditional
Chinese font and T1) which can be directly processed with TeXLive
2005. The attached CJKutf8.sty should be in the same directory as
CJKutf8-test.tex so that the old version (from TeXLive) is overridden.
Finally, the attached PDF shows the result. As can be seen, the
handling of UTF8 is not restricted to the CJK ranges -- it properly
maps LaTeX macros (as defined in the .dfu files) also.


Werner
-------------- next part --------------
\documentclass[12pt]{article}

\usepackage{CJKutf8}
\usepackage[T1]{fontenc}

\usepackage[unicode]{hyperref}


\begin{document}

\begin{CJK}{UTF8}{bsmi}
\section{f\"o\v{c} ? b\`a\v{r}}

?
\end{CJK}

\end{document}

%%% Local Variables:
%%% coding: utf-8
%%% mode: latex
%%% TeX-master: t
%%% End:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CJKutf8.sty.bz2
Type: application/octet-stream
Size: 2049 bytes
Desc: not available
Url : http://tug.org/pipermail/pdftex/attachments/20060216/ba83463e/CJKutf8.sty-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CJKutf8-test.pdf
Type: application/pdf
Size: 20778 bytes
Desc: not available
Url : http://tug.org/pipermail/pdftex/attachments/20060216/ba83463e/CJKutf8-test-0001.pdf
Edward G.J. Lee
2006-02-16 17:51:53 UTC
Permalink
Post by Werner LEMBERG
Here is my solution to make hyperref directly emit UTF-16BE bookmarks.
Please note that the surrogate stuff isn't tested yet (there might be
errors which should be easy to fix nevertheless).
Attached is a simple input file CJKutf8-test.tex (using a traditional
Chinese font and T1) which can be directly processed with TeXLive
2005. The attached CJKutf8.sty should be in the same directory as
CJKutf8-test.tex so that the old version (from TeXLive) is overridden.
Finally, the attached PDF shows the result. As can be seen, the
handling of UTF8 is not restricted to the CJK ranges -- it properly
maps LaTeX macros (as defined in the .dfu files) also.
Thanks for your quick response.

But I still have trouble(in attchement).



Edward
Werner LEMBERG
2006-02-16 20:43:33 UTC
Permalink
Post by Edward G.J. Lee
Thanks for your quick response.
But I still have trouble(in attchement).
Aah, the input file has the wrong encoding, sorry. Here it is again,
uuencoded.


Werner

======================================================================


begin 666 CJKutf8-test.tex
M7&1O8W5M96YT8VQA<W-;,3)P=%U[87)T:6-L97T*"EQU<V5P86-K86=E>T-*
M2W5T9CA]"EQU<V5P86-K86=E6U0Q77MF;VYT96YC?0H*7'5S97!A8VMA9V5;
M=6YI8V]D95U[:'EP97)R969]"@H*7&)E9VEN>V1O8W5M96YT?0H*7&)E9VEN
M>T-*2WU[551&.'U[8G-M:7T*("!<<V5C=&EO;GMF7")OQ(***@YIRL(&+#H%QV
M>W)]?0H*("#FG*P*7&5N9'M#2DM]"@I<96YD>V1O8W5M96YT?0H*)24E($QO
M8V%L(%9A<FEA8FQE<SH*)24E(&-O9&EN9SH@=71F+3@*)24E(&UO9&4Z(&QA
?=&5X"B4E)2!***@M;6%S=&5R.B!T"B4E)2!%;F0Z"@``
`
end
Edward G.J. Lee
2006-02-16 23:33:05 UTC
Permalink
Werner,
Post by Werner LEMBERG
Post by Edward G.J. Lee
Thanks for your quick response.
But I still have trouble(in attchement).
Aah, the input file has the wrong encoding, sorry. Here it is again,
uuencoded.
Still fail(in attachment).

I'm useing TeXLive-20051102.


Edward
Edward G.J. Lee
2006-02-17 00:23:08 UTC
Permalink
Post by Edward G.J. Lee
Werner,
Post by Werner LEMBERG
Aah, the input file has the wrong encoding, sorry. Here it is again,
uuencoded.
Still fail(in attachment).
I'm useing TeXLive-20051102.
Sorry, my mistake. I forgot to use the new CJKutf8.sty.

It works very well, I will try some deep test. Thanks.


Edward
gnwiii at gmail.com ()
2006-02-17 00:23:16 UTC
Permalink
Post by Edward G.J. Lee
Werner,
Post by Werner LEMBERG
Post by Edward G.J. Lee
Thanks for your quick response.
But I still have trouble(in attchement).
Aah, the input file has the wrong encoding, sorry. Here it is again,
uuencoded.
Still fail(in attachment).
I'm useing TeXLive-20051102.
I also have TeXLive-20051102 (but with pdftex updated to
pdfeTeXk, Version 3.141592-1.30.6-2.2 (Web2C 7.5.5)). The uuencoded
attachment works for me.

--
George N. White III <***@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia
Werner LEMBERG
2006-02-17 00:25:49 UTC
Permalink
Post by Edward G.J. Lee
Post by Werner LEMBERG
Aah, the input file has the wrong encoding, sorry. Here it is
again, uuencoded.
Still fail(in attachment).
I'm useing TeXLive-20051102.
Me too! Very strange. Please insert

\tracingall
\tracingonline0

right before the \section line and send me the output (compressed).


Werner
Edward G.J. Lee
2006-02-18 14:32:26 UTC
Permalink
Post by Werner LEMBERG
Post by Edward G.J. Lee
Thanks for your quick response.
But I still have trouble(in attchement).
Aah, the input file has the wrong encoding, sorry. Here it is again,
uuencoded.
Werner, how can I use \tableofcontents and \pdfauthor?

Thanks.


Edward
Werner LEMBERG
2006-02-19 05:42:43 UTC
Permalink
Post by Edward G.J. Lee
Werner, how can I use \tableofcontents
As usual. Just make sure that you have \newpage before you close the
CJK environment so that all delayed write commands are handled
properly (this is the same as with footnotes containing CJK
characters).
Post by Edward G.J. Lee
and \pdfauthor?
I don't know a command called \pdfauthor. You probably mean the
`pdfauthor' option of hyperref, right? In my simple tests, using
\hypersetup{pdfauthor={...}} within the CJK environment works just
fine.


Werner
Werner LEMBERG
2006-03-24 16:40:52 UTC
Permalink
Post by Werner LEMBERG
Here is my solution to make hyperref directly emit UTF-16BE
bookmarks. Please note that the surrogate stuff isn't tested yet
(there might be errors which should be easy to fix nevertheless).
This change to CJKutf8.sty is now in the CVS. Heiko, I'm using
\pdfstringdefPreHook to register my changed macros. Is this OK?


Werner
Heiko Oberdiek
2006-03-25 00:39:38 UTC
Permalink
Post by Werner LEMBERG
Post by Werner LEMBERG
Here is my solution to make hyperref directly emit UTF-16BE
bookmarks. Please note that the surrogate stuff isn't tested yet
(there might be errors which should be easy to fix nevertheless).
This change to CJKutf8.sty is now in the CVS. Heiko, I'm using
\pdfstringdefPreHook to register my changed macros. Is this OK?
I don't know, because I do not see here, how it is used.
It is a hook, thus it should not overwrite previous contents.
The new additions should be appended or prepended. Also, if
it is undefined, then just define it, hyperref doesn't overwrite
the hook, when it is loaded.

Yours sincerely
Heiko <***@uni-freiburg.de>
Werner LEMBERG
2006-03-25 11:07:04 UTC
Permalink
I don't know, because I do not see here, how it is used. It is a
hook, thus it should not overwrite previous contents. The new
additions should be appended or prepended. Also, if it is undefined,
then just define it, hyperref doesn't overwrite the hook, when it is
loaded.
Aah, I missed that. Now I'm doing

\ifx\pdfstringdefPreHook \undefined
\def\pdfstringdefPreHook{}
\fi
\***@addto@macro\pdfstringdefPreHook{%
\let\***@XX \***@XXpdf
\let\***@XXX \***@XXXpdf
\let\***@XXXX \***@XXXXpdf}

which is what you suggest, I think.


Werner
Heiko Oberdiek
2006-03-25 12:16:38 UTC
Permalink
Post by Werner LEMBERG
I don't know, because I do not see here, how it is used. It is a
hook, thus it should not overwrite previous contents. The new
additions should be appended or prepended. Also, if it is undefined,
then just define it, hyperref doesn't overwrite the hook, when it is
loaded.
Aah, I missed that. Now I'm doing
Or a variation:

\@ifundefined{pdfstringdefPreHook}\def\***@addto@macro{%
\let\***@XX \***@XXpdf
\let\***@XXX \***@XXXpdf
\let\***@XXXX \***@XXXXpdf
}
Post by Werner LEMBERG
which is what you suggest, I think.
Yes.

Yours sincerely
Heiko <***@uni-freiburg.de>

Ross Moore
2006-02-23 08:23:22 UTC
Permalink
Hi all,

I want to revisit this thread, but in connection with placing
mathematical symbols into bookmarks.
Post by Edward G.J. Lee
But is it possible change the PDF outlines' encoding to UTF-16BE
via hyperref or CJK itself?
I am not a CJK expert, something for Werner.
The problem will be the recodings, I don't think someone wants to
implement something like
&Encode::from_to($char, "Big5", "UCS-2");
at TeX macro level.
hyperref offers two hooks where the outline strings can be
* \pdfstringdefPreHook: This hook is used before the
string is expanded and is mainly used for redefinitions;
Thanks for the hint.
But I don't think I can write the TeX macro to convert the encoding
to UTF16BE [yet]. :)
* \pdfstringdefPostHook#1: #1 contains the macro with the
expanded bookmark string. Thus the bookmark string
can be postprocessed.
The string resulting from this gets written to the .out file.
It's not until the next run of pdfTeX that the correct bookmarks
appear.

This means that it should also be possible to preprocess the .out
file before the next run; e.g. using

\AtEndDocument{\immediate\write18{%
<some system command or script> \jobname.out
}}


So if I want to replace the strings 'lambda', 'alpha', 'omega', etc.
by appropriate unicode representations,

a. what needs to go into the .out file ?

b. what else needs to be done ?
e.g. options to hyperref, or \hypersetup

c. what version of pdfTeX is needed ?

d. what actual font will be used in the PDF browser ?
Do I need to supply font subsets inside the .pdf file ?

Also,
Is it possible to use different typefaces ?
Can super/sub-scripts be supported in bookmarks ?
If so, how ?
Post by Edward G.J. Lee
Edward
Thanks for any advice on this kind of thing.

Ross

------------------------------------------------------------------------
Ross Moore ***@maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 +2 9850 8955
Sydney, Australia 2109 fax: +61 +2 9850 8114
------------------------------------------------------------------------
Reinhard Kotucha
2006-02-23 10:34:53 UTC
Permalink
Post by Ross Moore
d. what actual font will be used in the PDF browser ?
I recently used strace to find out why it takes so much time to launch
acroread and found out that it examines the X11 font directory.

The only explanation I have for this behavior is that it tries to use
fonts provided by the operating system for bookmarks if they are not
provided by acroread itself.

However, acroread provides a reasonable amount of fonts itself and CJK
fonts can be downloaded from Adobe if needed. I assume that Adobe
provides all fonts which are needed for the menues for all languages
supported by acroread. But bookmarks can be in arbitrary languages.
Post by Ross Moore
Do I need to supply font subsets inside the .pdf file ?
Don't know whether this is possible. It would be great because it
would make documents more portable.

Regards,
Reinhard
--
----------------------------------------------------------------------------
Reinhard Kotucha Phone: +49-511-4592165
Marschnerstr. 25
D-30167 Hannover mailto:***@web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------
Edward G.J. Lee
2006-02-23 15:51:03 UTC
Permalink
Post by Ross Moore
So if I want to replace the strings 'lambda', 'alpha', 'omega', etc.
by appropriate unicode representations,
a. what needs to go into the .out file ?
The UTF-16BE octal of 'lambda', 'alpha', 'omega'?

For 'omaga' it would be '\003\251'. That is,

\Omega => \003\251

I'm not sure who(pdfTeX or hyperref or any other tools)
should do this job.
Post by Ross Moore
d. what actual font will be used in the PDF browser ?
Do I need to supply font subsets inside the .pdf file ?
AFAIK, PDF browser use system's font to display PDF text
string in the PDF outline, text annotations and document
information.

I don't know if PDF brwoser can use the font embedded in
the PDF file when render the text string in the PDF outline.

The problem is PDF spec use two encoding in PDF text string
only,

1. PDFDocEncoding(super set of latin1).
2. Unicode character encoding(UTF-16BE).

So it is problem to use the font embedded in PDF file, I think.
Post by Ross Moore
Also,
Is it possible to use different typefaces ?
Can super/sub-scripts be supported in bookmarks ?
If so, how ?
If your system has a full supported Unicode font maybe it
can do some partial 'translation', I think.
[Note] I don't mean AR/Kpdf or pdfTeX/hyperref have this
function so far.

But maybe I'm wrong. :)


Edward
Heiko Oberdiek
2006-02-23 16:55:07 UTC
Permalink
Post by Ross Moore
I want to revisit this thread, but in connection with placing
mathematical symbols into bookmarks.
\texorpdfstring, \pdfstringdefDisableCommands
Post by Ross Moore
* \pdfstringdefPostHook#1: #1 contains the macro with the
expanded bookmark string. Thus the bookmark string
can be postprocessed.
The string resulting from this gets written to the .out file.
It's not until the next run of pdfTeX that the correct bookmarks
appear.
This means that it should also be possible to preprocess the .out
file before the next run; e.g. using
\AtEndDocument{\immediate\write18{%
<some system command or script> \jobname.out
}}
So if I want to replace the strings 'lambda', 'alpha', 'omega', etc.
by appropriate unicode representations,
a. what needs to go into the .out file ?
b. what else needs to be done ?
e.g. options to hyperref, or \hypersetup
Many Greek letters are already supported, given as \text... macros.

\usepackage[unicode]{hyperref}

\pdfstringdefDisableCommands{%
\let\lambda\textlambda
\let\alpha\textalpha
\let\omega\textomega
% etc.
}

But this is not the problem with math.
Bookmarks are not typesetted areas, they are just text strings.
Post by Ross Moore
c. what version of pdfTeX is needed ?
I don't see a dependency from the pdfTeX version.
Post by Ross Moore
d. what actual font will be used in the PDF browser ?
Do I need to supply font subsets inside the .pdf file ?
No, the fonts are not taken from the .pdf file but from the
system, where the pdf browser is installed.
Post by Ross Moore
Also,
Is it possible to use different typefaces ?
AFAIK you can use color or bold/italic for the whole string.
Post by Ross Moore
Can super/sub-scripts be supported in bookmarks ?
Except for a few letters (twosuperior, ...) no.

It is possible that some pdf browsers support some own methods.
xpdf seems to use "pango" for the bookmarks, whatever this means.

Yours sincerely
Heiko <***@uni-freiburg.de>
Ross Moore
2006-03-02 10:28:57 UTC
Permalink
Hi Heiko,
Post by Heiko Oberdiek
Post by Ross Moore
I want to revisit this thread, but in connection with placing
mathematical symbols into bookmarks.
\texorpdfstring, \pdfstringdefDisableCommands
Thanks for these.
My problem was not so much *where* to make alternate definitions,
but *what* these should be for Unicode strings.

Indeed, for my application it might be useful to have a macro:

\texorpdforXMLorHTMLorliteral

in which appropriate \catcode changes were made with each
variant of the macro-expansion. :-)
Post by Heiko Oberdiek
Post by Ross Moore
So if I want to replace the strings 'lambda', 'alpha', 'omega', etc.
by appropriate unicode representations,
a. what needs to go into the .out file ?
b. what else needs to be done ?
e.g. options to hyperref, or \hypersetup
Many Greek letters are already supported, given as \text... macros.
\usepackage[unicode]{hyperref}
\pdfstringdefDisableCommands{%
\let\lambda\textlambda
\let\alpha\textalpha
\let\omega\textomega
% etc.
}
OK. It's the double-octal notation used for Unicode strings
that I'd not encountered before. Thanks for the heads-up.


This works (so far) in my setting, with the following provisos:

a. the .out file more than doubles in size, which
increase occurs also in the PDF.
But this is only ~5kb increase, so no big deal really.

Presumably this could be reduced by using Unicode only for
those bookmarks that really need it.


b. the loading of puenc.def causes a macro-name clash,
with those math-authors who like to define \C
as a shorthand for \mathbb{C} or \mathcal{C}
--- easily fixed, but most annoying.

Presumably these guys never use cyrillics for Russian
or Eastern European names in bibliographies.
Post by Heiko Oberdiek
But this is not the problem with math.
Bookmarks are not typesetted areas, they are just text strings.
Post by Ross Moore
c. what version of pdfTeX is needed ?
I don't see a dependency from the pdfTeX version.
OK. That's nice to know.
Post by Heiko Oberdiek
Post by Ross Moore
d. what actual font will be used in the PDF browser ?
Do I need to supply font subsets inside the .pdf file ?
No, the fonts are not taken from the .pdf file but from the
system, where the pdf browser is installed.
Yep; I got that impression from another reply.
It'd be nice to be able to de-reference a stream for this.
But if that's not in the PDF spec, then too bad.
Post by Heiko Oberdiek
Post by Ross Moore
Also,
Is it possible to use different typefaces ?
AFAIK you can use color or bold/italic for the whole string.
And you intend working on providing support for this, right ?
That's something that I could make some use of.

The Adobe document for the PDF 1.6 specs shows what is needed
for colours and faces (italic and/or bold) in bookmarks.

However, the same document actually has a logo-image in each
of its own bookmarks! How did they do that ?
Post by Heiko Oberdiek
Post by Ross Moore
Can super/sub-scripts be supported in bookmarks ?
Except for a few letters (twosuperior, ...) no.
Understandable, if a single string is all that's allowed.

However, there are raised and lowered letters in the
"Phonetic Extensions" area, and elsewhere.

I've now made use of these, to produce raised superscripts in
mathematics used for titles, etc. when it contains only:

a. letters, excluding fqzCFQSVXYZ
or b. digits 0-9
or c. symbols + - = ( )
or d. punctuation , . (i.e. comma or stop).

Similarly for subscripts, using just the characters in b. and/or c.

The TeX coding to achieve this makes slight patches to some
hyperref methods:

\HyPsd@@RemoveBraces to retain markers of bracings
\***@CatcodeWarning to retain ^ and _
\***@ConvertToUnicode to allow some extra post-processing
before converting to Unicode

as well as adding new post-processing methods prior to
using \***@ConvertToUnicode :

\***@BracedSupscripts handles ^{...}
\***@falseBracePairs removes any left-over brace markers

and methods added via the \pdfstringdefPostHook :

\replaceSupAst ^* becomes just *
\replaceSupscript handles non-braced ^
\replaceSubscript handles non-braced _

as well as many macro re-definitions, via
\pdfstringdefDisableCommands .


Some examples can be seen in the attached PNG snapshot images
--- if they make it through the list-server.
These show superscripts, subscripts and some exotic math-symbols.
(The PDF browser is Apple's 'Preview'.)

-------------- next part --------------

-------------- next part --------------

-------------- next part --------------
Post by Heiko Oberdiek
It is possible that some pdf browsers support some own methods.
xpdf seems to use "pango" for the bookmarks, whatever this means.
No idea.
Post by Heiko Oberdiek
Yours sincerely
Thanks, as always, for your help

Ross

------------------------------------------------------------------------
Ross Moore ***@maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 +2 9850 8955
Sydney, Australia 2109 fax: +61 +2 9850 8114
------------------------------------------------------------------------
Heiko Oberdiek
2006-03-02 15:42:49 UTC
Permalink
This post might be inappropriate. Click to display it.
Loading...