Hi David,
I've also been thinking further about this. It seems to me the issue is
------------> output for XML
input data |
-------------> STORAGE -------|
FORMAT |
------------> output for BibTeX/LaTeX
- input and store all data as unicode
- during output perform needed translations
This may be a necessary tradeoff. So far, the LaTeX diehards were able
to use many LaTeX constructs in e.g. the author names or the titles
(think of italics, superscripts, or subscripts which are not uncommon
in e.g. physics or chemistry papers). This may be a pain in the neck
to search afterwards, but at least you could do it. If we follow the
simplified scheme you outline above, you can no longer use these LaTeX
hacks. I'd like to hear from the LaTeX users (I'm not one of them
currently) how important it is to include LaTeX markup into the data.
1. BibTeX/LaTeX
# $ % & ~ _ ^ \ { }
I'm afraid there's more to it. We have to remove lots of commands like
the above mentioned boldface, italics, superscript, subscript and
such. These commands do not make any sense in the context of
SGML/XML. We also have to translate foreign characters (\"{a}, {\ss} and
similar constructs. Part of this translation can be achieved through
tex2mail, although it does not seem to create UTF-8 (but see below).
I was under the impression that &, <, and > always have to
be replaced as these are part of the XML markup. Why and to what would
you like to convert ' and "?
The question then arises as to whether any other translation is
necessary and/or desirable. In theory no other translation is
necessary. LaTeX can process "raw" unicode using the 'ucs' package.
This is good news. My only LaTeX book dates back to 1999, and Unicode
does not seem to be mentioned. The transformations would be so much
simpler if we didn't have to create LaTeX commands to represent foreign
or special characters.
The XML standard states, "Legal characters are tab, carriage return,
line feed, and the legal graphic characters of Unicode and ISO/IEC 10646".
This is pretty much what RefDB currently outputs.
Having said that, it may be desirable to translate non-ascii characters
into decimal numeric character references (e.g., 'â') for XML or,
for LaTeX, appropriate escape sequences. Perhaps this could be optional?
I think it is common to leave the non-ascii characters in the xml file
and use the proper charset declaration (UTF-8 by default). IMHO
character entities do not have any advantage over UTF-8. I'm not sure
about LaTeX output. How hard is it to make the use of the ucs package
mandatory for RefDB users? Once it is installed, it is as simple as
inserting one line at the top of your document, isn't it?
One interesting consequence of this is that author names may contain
non-ascii characters. If, when new references are added to refdb, there
is no citation key specified, the citekey is constructed by mangling
primary author surname and year. If citekey is restricted to ascii
characters then non-ascii author surname characters would have to be
stripped or converted (e.g., ä -> a, ß -> ss).
Currently non-ascii characters are simply stripped. You always have
the option to specify a citation key explicitly when adding a
reference, using any reasonable translation of the foreign characters
to ascii.
escapechars converts non-ASCII (UTF-8, Latin-1 etc.) files to
ASCII with XML or TeX
escape sequences
latex2utf8txt converts LaTeX files to UTF-8 text, removes line
breaks from paragraphs
Thanks for the pointers. I've downloaded these scripts and will give
them a try. If the latter works as advertized, it could be used as a
post-processing filter after bib2ris (or, if I'll ever end up having
too much time on my hands, I could reimplement bib2ris in Perl and
integrate the conversion code). The former is a bit trickier as the
conversion should run in refdbd. However, the script looks simple
enough that I might be able to recode the algorithm in C.
regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de