Discussion:
[Refdb-devel] latex bibliographies with multiple databases
Markus Hoenicka
2006-07-06 21:57:25 UTC
Permalink
When I specify no database in the document -- with all references from
one database -- and specify that database as a runbib parameter, there
is no problem. If, however, I use the second method of specifying
'\cite{<database>-<reference>}' the process fails.
I'll set up a test case and see what happens. I recall it might be
necessary to specify a default database anyway (i.e. use the -d switch
of runbib) even if you specify a database in each citation. Does the
problem persist if you set a default database?

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-06 22:42:07 UTC
Permalink
Post by Markus Hoenicka
I'll set up a test case and see what happens. I recall it might be
necessary to specify a default database anyway (i.e. use the -d switch
of runbib) even if you specify a database in each citation. Does the
problem persist if you set a default database?
Upon checking the code I noticed that I had to disable support for
using more than one database for some reason. It seems to be related
to the fact that I wanted to use the citation key as reference without
an ID prefix or something. However, this way I can't safely separate a
database part from the citation key proper. I'll have to figure out
something how I can make this work again.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-07 00:02:32 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
Upon checking the code I noticed that I had to disable support for
using more than one database for some reason.
I'll have to figure out something how I can make this work again.
This would be a *very* nice feature to have. For what it's worth,
however, forcing the use of a single database per document would simply
bring LaTeX support in line with DocBook document support.

Regards,
David.
Markus Hoenicka
2006-07-07 07:00:50 UTC
Permalink
Post by David Nebauer
This would be a *very* nice feature to have. For what it's worth,
however, forcing the use of a single database per document would simply
bring LaTeX support in line with DocBook document support.
I didn't try lately, but the reverse might be true. Once upon a time the
DocBook/TEI code also allowed using more than one database. I'll check.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-07 23:11:55 UTC
Permalink
Post by Markus Hoenicka
I didn't try lately, but the reverse might be true. Once upon a time the
DocBook/TEI code also allowed using more than one database. I'll check.
I've checked the situation with DocBook and TEI. Both support database
names in citations using the full format. refdbxp does not support
database names, presumably because the short citation format has no
means to encode a database name.

I've fixed refdbd to support multiple databases in bibtex too.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-08 05:27:12 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
I've fixed refdbd to support multiple databases in bibtex too.
Now it is not working for me at all.

runbib is retrieving no information. Here is an annotated transcript:

------------------------------------------------------------------------------------------
Post by Markus Hoenicka
Here are the tex-related files <<
$ ls
-rw-r--r-- 1 david david 155 2006-07-08 14:13 test.aux
-rw-r--r-- 1 david david 275 2006-07-08 14:06 test.bbl
-rw-r--r-- 1 david david 1 2006-07-08 14:22 test.bib
-rw-r--r-- 1 david david 914 2006-07-08 14:06 test.blg
-rw-r--r-- 1 david david 696 2006-07-08 14:13 test.dvi
-rw-r--r-- 1 david david 3017 2006-07-08 14:13 test.log
-rw-r--r-- 1 david david 480 2006-07-08 14:06 test.tex
-rw-r--r-- 1 david david 178 2006-07-06 15:36 test.tex~
Post by Markus Hoenicka
Here are the references in the tex document <<
$ cat test.tex
% File: test.tex
% Created: Thu Jul 06 03:00 PM 2006 C
% Last Change: Thu Jul 06 03:00 PM 2006 C
%
\documentclass[a4paper]{article}
\usepackage{natbib}
\author{David Nebauer}
\title{Test Document}
\begin{document}
\maketitle

\section{Introduction}

This is a test of the RefDB application used in conjunction with
vim-latexsuite. Here is a reference \cite{Agnew0}. Here is another
\cite{Weckert0}.

\bibliographystyle{plainnat}
\bibliography{test}

\end{document}
Post by Markus Hoenicka
Let me prove the references exist <<
$ refdbc -C getref -d refs_computing :CK:=Agnew0
ID*:17 (2000)
Key: Agnew0
Agnew,Grace
Government Access to Encryption Keys


999:1 retrieved:0 failed
$ refdbc -C getref -d refs_computing :CK:=Weckert0
ID*:23 (1997)
Key: Weckert0
Weckert,J.
Intellectual Property Rights and Computer Software
Business Ethics: A European Review 6(2):102-109

999:1 retrieved:0 failed
Post by Markus Hoenicka
Let me show the test.aux file includes those references <<
$ cat test.aux
\relax
\citation{Agnew0}
\citation{Weckert0}
\bibstyle{plainnat}
\bibdata{test}
Post by Markus Hoenicka
Here is the runbib command that returns no results <<
$ runbib -d refs_computing -S bibtex-full -t bibtex test
999:0 retrieved:0 failed
Post by Markus Hoenicka
The corresponding refdbib command also returns nothing <<
$ refdbib -d refs_computing -S bibtex-full -t bibtex test.aux > test.bib
999:0 retrieved:0 failed
Post by Markus Hoenicka
The test.bib file remains empty! <<
$ cat test.bib

$
------------------------------------------------------------------------------------------

I ran refdbd standalone at log setting 7. Here is the feedback
generated when running the runbib or refdbib command as above:

------------------------------------------------------------------------------------------
adding client 127.0.0.1 on fd 5
server waiting n_max_fd=5
try to read from client
serving client on fd 5 with protocol version 4
012-58-51-27
send pseudo-random string to client
parent removing client on fd 5
server waiting n_max_fd=4
gettexbib -u david -w xxxxxxxxxxxxxxxxxxxxxxxxxxx -d refs_computing -s
bibtex-full 19
dbi is up
localhost
david
daviduser
refs_computing

sqlite
/var/lib/refdb/db

refdb
connected to database server using database:
refdb
Main database looks ok:
refdb
localhost
david
daviduser
refs_computing

sqlite
/var/lib/refdb/db

refs_computing
SELECT meta_app,meta_type,meta_dbversion from t_meta
connected to database server using database:
refs_computing
command processing done, finish dialog now
child finished client on fd 5
child exited with code 0
server waiting n_max_fd=4
------------------------------------------------------------------------------------------

I'm not familiar with the 'gettexbib' command. The '19' initially
looked a little strange but I had a quick dive into refdbib.c and it
looks like that is a legitimate parameter -- the command buffer string
length.

Adding the database name to each reference as per the manual makes no
difference.

Regards,
David.

P.S. The refdb-users lists is rejecting my posts sporadically with the
claim my ISP is not providing a postmaster address, so I'm copying all
my posts to your personal email address.
Markus Hoenicka
2006-07-08 09:28:20 UTC
Permalink
Post by David Nebauer
This is a test of the RefDB application used in conjunction with
vim-latexsuite. Here is a reference \cite{Agnew0}. Here is another
\cite{Weckert0}.
Before my latest patch, these citations probably did work. I've implemented this
version a while ago to move to a citation syntax familiar to LaTeX users, i.e.
use the citation key in curly brackets. However, this does not allow a safe
distinction of citation keys with and without a database part. In order to
support multiple databases I had to revert this to the original format where
the citation key is prefixed with "ID" or "dbname-ID". The following is
supposed to work:

This is a test of the RefDB application used in conjunction with
vim-latexsuite. Here is a reference \cite{IDAgnew0}. Here is another
\cite{otherdb-IDWeckert0}.

I'm open for suggestions if you know a better way to safely distinguish database
names from the citation key proper.

While testing the code I came across a problem with bibliography entries which
contain ampersands. The ampersand seems to be a control character in
LaTeX/bibtex and needs to be escaped in the bibtex output. I'll look into this
shortly.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-08 11:26:48 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
In order to
support multiple databases I had to revert this to the original format where
the citation key is prefixed with "ID" or "dbname-ID".
I'm open for suggestions if you know a better way to safely distinguish database
names from the citation key proper.
I'm afraid the current scheme is unusable. It is possible for citation
keys to contains hyphens -- in fact, the default key for a hyphenated
author surname contains a hyphen. Try using a hyphenated citation key
and watch the fun. Even better, combine a database name with the
hyphenated citation key -- that introduces another hyphen. Even more fun.

refdb allows you to create databases whose names contain hyphens. I
haven't tried including a hyphenated database with a hyphenated citation
key in the one citation -- I'm too scared. Interestingly, while I can
create a database with a hyphenated name I can't delete it (at least
with an sqlite backend) -- the deletedb operation fails.

IIRC, it is illegal to include colons in citation keys. I seem to
recall they are automatically stripped out. It is currently possible to
include hyphens in database names. But, if you made it illegal to
include a hyphen in a database name that gives you a ready-made
delimiter to use in citations.

Regards,
David.
Markus Hoenicka
2006-07-08 12:14:01 UTC
Permalink
Post by David Nebauer
I'm afraid the current scheme is unusable. It is possible for citation
keys to contains hyphens -- in fact, the default key for a hyphenated
author surname contains a hyphen. Try using a hyphenated citation key
and watch the fun. Even better, combine a database name with the
hyphenated citation key -- that introduces another hyphen. Even more fun.
That's why the code does not rely on the hyphen as a separator, but on the
sequences "ID" and "-ID", which are checked for in this particular order from
left to right. Unless a citation is malformed, you can have as many hyphens or
even "-ID" sequences in your citation keys as you like:

\cite{IDMILLER-IDRUM-2005}
**
\cite{dbname-IDMILLER-IDRUM-2005}
***

The '*' mark the database name prefix separator in both cases. Unless I'm dense
this is foolproof as far as citation keys are concerned. Trouble may arise when
you use database names like "IDBASE" or "DATA-IDBASE". RefDB would have to
reject these names in order to avoid trouble.
Post by David Nebauer
refdb allows you to create databases whose names contain hyphens. I
haven't tried including a hyphenated database with a hyphenated citation
key in the one citation -- I'm too scared. Interestingly, while I can
create a database with a hyphenated name I can't delete it (at least
with an sqlite backend) -- the deletedb operation fails.
I'll have to investigate this. SQLite databases are deleted on the filesystem
level by using an unlink() system call - I can't imagine why that would fail
with a hyphen in the filename.
Post by David Nebauer
IIRC, it is illegal to include colons in citation keys. I seem to
recall they are automatically stripped out. It is currently possible to
include hyphens in database names. But, if you made it illegal to
include a hyphen in a database name that gives you a ready-made
delimiter to use in citations.
Yes, colons are not allowed in IDREF attributes (xref linkend). And yes, refdbd
indeed strips out colons in citation keys to avoid creating invalid output.

The remainder of your suggestion is less clear to me. If I understand correctly,
you suggest to use the hyphen under the assumption that database names never
have hyphens (this could indeed be enforced). But how do you distinguish
between a citation key prefixed with a database name and a sole citation key
containing a hyphen? As in:

dbname-citekey
cite-key

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-08 14:21:46 UTC
Permalink
IIRC, it is illegal to include colons in citation keys. I seem to
recall they are automatically stripped out. It is currently possible
to include colons in database names. But, if you made it illegal to
include a colon in a database name that gives you a ready-made
delimiter to use in citations.
I think that suggestion makes more sense.
Well, if I understand *that* correctly, you suggest to use these forms in LaTeX
documents:

\cite{citekey}
\cite{dbname:citekey}

This is more compact than the "-ID" kludge that I currently use. The only
downside, if at all, is that this citation syntax is different from the one
used in the full style in SGML/XML documents (where, as noted previously, we
must not use a colon). However, this might only confuse those who work with
both LaTeX and SGML/XML. Unless someone else has objections, I'll implement
your suggestion shortly.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-08 22:22:35 UTC
Permalink
Post by Markus Hoenicka
\cite{citekey}
\cite{dbname:citekey}
Just to let y'all know that the current Subversion version supports
the above mentioned citation format in LaTeX documents.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-12 07:58:00 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
Post by Markus Hoenicka
\cite{citekey}
\cite{dbname:citekey}
Just to let y'all know that the current Subversion version supports
the above mentioned citation format in LaTeX documents.
I'm experiencing all kinds of difficulty using the latest svn refdb
build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX
format unless citations are in the previous '\cite{[dbname-]IDcitekey}'
format. Using the new citation format on my system results in '999:0
retrieved:0 failed'. Would you mind checking on your system? If the
new format works for you it must be a problem at my end and I'll work up
a test case.

Regards,
David.
Markus Hoenicka
2006-07-12 12:13:06 UTC
Permalink
Hi David,
Post by David Nebauer
I'm experiencing all kinds of difficulty using the latest svn refdb
build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX
format unless citations are in the previous '\cite{[dbname-]IDcitekey}'
format. Using the new citation format on my system results in '999:0
retrieved:0 failed'. Would you mind checking on your system? If the
new format works for you it must be a problem at my end and I'll work up
a test case.
I hardly dare to ask, but are you sure you installed the svn version and
restarted refdbd? I just checked the svn code, and all changes are in place.
The current svn code certainly does not look for "-ID" but for ":" as a
database separator. I'm sure that the new format works on my FreeBSD box.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-12 14:20:02 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
Post by David Nebauer
I'm experiencing all kinds of difficulty using the latest svn refdb
build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX
format unless citations are in the previous '\cite{[dbname-]IDcitekey}'
format.
I hardly dare to ask, but are you sure you installed the svn version and
restarted refdbd?
I install from custom deb packages so refdbd is stopped and started as
part of the debian package upgrade process. My source tree is at
revision 81.

I checked everything, including the source code changes you made at
version 72 to alter the citation format. Finally I remembered some
advice you gave recently about checking for multiple running instances
of refdbd. Sure enough, I had an extra instance running, probably from
a debugging exercise where I was running refdbd in standalone mode.
Once stopped the old behaviour went away.

Problem solved (he says sheepishly).

FWIW, I can confirm the new citation format is working correctly.

Regards,
David.
Markus Hoenicka
2006-07-08 14:35:00 UTC
Permalink
David Nebauer <***@switch.com.au> was heard to say:
Markus Hoenicka
2006-07-08 22:19:31 UTC
Permalink
There's something else. You may recall some time ago all the trouble
taken to ensure entities such as &mdash; and &amp; are preserved in
database reference entries and subsequently then preserved throughout
DocBook processing. Many of my references include entities in document
titles. Well, those entities are now appearing in the bibtex entries
created by runbib. As you noted, the raw ampersands choke LaTeX. Is
there any way of converting those xml-safe entities to LaTeX equivalents
as runbib exports them? In the case of '&mdash;' that would be '---'.
I thought about this a bit more. I'm afraid this is going to get far
more complex than I thought in the first place. We need to:

- replace XML entities that stem from risx documents or which were
deliberately used in RIS data. E.g. '&mdash;' -> '---'

- backslash-escape LaTeX command characters unless, and that's the
catch, they are used as LaTeX commands. A LaTeX-only user may
rightfully expect e.g. author names like 'H\"{a}\{ss}ler' (as
imported from a bibtex file) to be processed correctly, or
e.g. '{\bf emphasized}' words in titles. refdbd would have to
acquire a thorough knowledge of LaTeX commands to cope with this.

- translate foreign letters and letters with diacritics to their TeX
equivalents from, and that's the catch here, any supported character
encoding. The same TeX representation of such a letter may be
encoded as a variety of one to three-byte sequences in different
uni- or multibyte character sets.

Is anyone aware of a library or a tool that implements these
transformations? I'm only aware of tex2mail which does the reverse.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-10 21:31:09 UTC
Permalink
Hi David,
I've also been thinking further about this. It seems to me the issue is
------------> output for XML
input data |
-------------> STORAGE -------|
FORMAT |
------------> output for BibTeX/LaTeX
- input and store all data as unicode
- during output perform needed translations
This may be a necessary tradeoff. So far, the LaTeX diehards were able
to use many LaTeX constructs in e.g. the author names or the titles
(think of italics, superscripts, or subscripts which are not uncommon
in e.g. physics or chemistry papers). This may be a pain in the neck
to search afterwards, but at least you could do it. If we follow the
simplified scheme you outline above, you can no longer use these LaTeX
hacks. I'd like to hear from the LaTeX users (I'm not one of them
currently) how important it is to include LaTeX markup into the data.
1. BibTeX/LaTeX
# $ % & ~ _ ^ \ { }
I'm afraid there's more to it. We have to remove lots of commands like
the above mentioned boldface, italics, superscript, subscript and
such. These commands do not make any sense in the context of
SGML/XML. We also have to translate foreign characters (\"{a}, {\ss} and
similar constructs. Part of this translation can be achieved through
tex2mail, although it does not seem to create UTF-8 (but see below).
2. XML/SGML
& <
' "
I was under the impression that &amp;, &lt;, and &gt; always have to
be replaced as these are part of the XML markup. Why and to what would
you like to convert ' and "?
The question then arises as to whether any other translation is
necessary and/or desirable. In theory no other translation is
necessary. LaTeX can process "raw" unicode using the 'ucs' package.
This is good news. My only LaTeX book dates back to 1999, and Unicode
does not seem to be mentioned. The transformations would be so much
simpler if we didn't have to create LaTeX commands to represent foreign
or special characters.
The XML standard states, "Legal characters are tab, carriage return,
line feed, and the legal graphic characters of Unicode and ISO/IEC 10646".
This is pretty much what RefDB currently outputs.
Having said that, it may be desirable to translate non-ascii characters
into decimal numeric character references (e.g., '&#226;') for XML or,
for LaTeX, appropriate escape sequences. Perhaps this could be optional?
I think it is common to leave the non-ascii characters in the xml file
and use the proper charset declaration (UTF-8 by default). IMHO
character entities do not have any advantage over UTF-8. I'm not sure
about LaTeX output. How hard is it to make the use of the ucs package
mandatory for RefDB users? Once it is installed, it is as simple as
inserting one line at the top of your document, isn't it?
One interesting consequence of this is that author names may contain
non-ascii characters. If, when new references are added to refdb, there
is no citation key specified, the citekey is constructed by mangling
primary author surname and year. If citekey is restricted to ascii
characters then non-ascii author surname characters would have to be
stripped or converted (e.g., ä -> a, ß -> ss).
Currently non-ascii characters are simply stripped. You always have
the option to specify a citation key explicitly when adding a
reference, using any reasonable translation of the foreign characters
to ascii.
escapechars converts non-ASCII (UTF-8, Latin-1 etc.) files to
ASCII with XML or TeX
escape sequences
latex2utf8txt converts LaTeX files to UTF-8 text, removes line
breaks from paragraphs
Thanks for the pointers. I've downloaded these scripts and will give
them a try. If the latter works as advertized, it could be used as a
post-processing filter after bib2ris (or, if I'll ever end up having
too much time on my hands, I could reimplement bib2ris in Perl and
integrate the conversion code). The former is a bit trickier as the
conversion should run in refdbd. However, the script looks simple
enough that I might be able to recode the algorithm in C.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-11 15:53:55 UTC
Permalink
Hi David,
If you instead go with my idea to store as unicode you don't need to
know anything about the eventual output format when you store the
reference. Indeed, the user doesn't have to know at that time. The
same references can be used for either DocBook or LaTeX. You can easily
add in other output formats later and all you have to do is write
another output filter.
I'm all with you here. I just wanted to get opinions from real-world LaTeX users
whether or not it makes sense to preserve the markup.
Your point is true but I say it is a small loss. Using LaTeX formatting
codes means your references can never be used for any other format
without hacking in some kind of conversion. RefDB is designed to be a
long-term reference database enabling the contained references to be
used all kinds of interesting ways. Use of format-specific markup
limits your future choices. As a minor example it prevents their use in
DocBook documents.
True, but I assumed that only those might want to keep the markup who use RefDB
solely for LaTeX.
Another issue is the ability of library and indexing systems to handle
such formatting complexities as superscripting, subscripting and font
changes. You know far more about such things than I, but I would guess
even the most complex article title is reduced to canonical ascii for
storage in many cataloguing systems. I presume the algorithms for such
simplification are fairly predictable. Anyone searching for the journal
article by title would be easily able to predict the stored character
sequence. I would endeavour to suggest the simplified form of title
would be entirely acceptable in any kind of bibliography.
In any event, how would such a complex title be stored in plain ascii?
Or Unicode? Or even XML (imagine the attempt to use MathML in a title
string!)?
The database which I use mostly (www.pubmed.org) indeed "ascii-izes" the titles.
The tagged format uses plain ASCII with a pretty crude transliteration, whereas
the XML format uses Unicode.
As mentioned above, I am unconvinced about the utility of keeping
boldface, italics, superscript and subscript-type markup. As for
foreign characters, almost any foreign character can be represented in
I'm afraid I didn't express my thoughts very well here. What I was talking about
is that a reference imported from bibtex may contain markup like

"Title with an {\bf emphasized} word"

It is not sufficient to escape characters but we have to remove the "{\bf " and
the "}" sequences before we import the reference. This is what one of the
scripts that you pointed me to as well as tex2mail do.
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as
"&apos;", and the double-quote character (") as "&quot;".
The relevant portion states, "The right angle bracket (>) *may* be
string ']]>'." (emphases mine)
The last paragraph in the quote refers to straight single and double
quotation mark entities.
But it appears to talk about attribute values. XML output from RefDB never puts
quotes into attribute values, so we're left with &,<,>.
It worked for me "out of the box". I installed the 'ucs' package
(apt-get install latex-ucs), added those two lines to the preamble, ran
'latex test' and, presto, gloriously rendered unicode.
This is great news indeed. I will have to mention this in the manual

I take from this discussion:

1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such
code) which strips markup like boldface, superscript etc. and translates
foreign characters entered as LaTeX constructs to their Unicode equivalents.

2) Modify the code to prevent XML entities to show up in LaTeX output.

3) Add code to escape the LaTeX command characters in the LaTeX output.

The second point is a bit tricky. References imported from RIS usually do not
contain entities, but references imported from risx are likely to do. Either I
convert these entities during import, or I remove them during LaTeX export. The
former seems cleaner to me, and I think this is what you had in mind.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-11 16:37:54 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such
code) which strips markup like boldface, superscript etc. and translates
foreign characters entered as LaTeX constructs to their Unicode equivalents.
2) Modify the code to prevent XML entities to show up in LaTeX output.
3) Add code to escape the LaTeX command characters in the LaTeX output.
The second point is a bit tricky. References imported from RIS usually do not
contain entities, but references imported from risx are likely to do. Either I
convert these entities during import, or I remove them during LaTeX export. The
former seems cleaner to me, and I think this is what you had in mind.
Yes, in my view the storage format is Unicode without markup:


BibTeX ------- ---------> DocBook
| |
| |
RIS ---------+--> STORAGE ----
| (Unicode) |
| |
RISX --------- ---------> LaTeX



Whatever the input format, all references end up in the same storage
format (Unicode sans markup). This would require stripping out XML
entities and LaTeX markup. With luck you can use existing tools to do
this. The stored references can then be output in either DocBook- or
LaTeX-compatible format. This seems to be to be an elegant way of
dealing with the mishmash of input and output formats.

Regards,
David.
David Nebauer
2006-07-11 19:52:26 UTC
Permalink
Hi Damien,
I'd like to see the following kind of latex markup reproduced (not
TITLE = {Why {AM} and {EURISKO} appear to work},
Alternately, you could save the title in plain Unicode with that
capitalisation and refdb's BibTeX output filter would wrap any
abnormally capitalised words in braces. That way your reference can be
used in either DocBook XML/SGML or LaTeX output.

Regards,
David.
Markus Hoenicka
2006-07-11 20:52:17 UTC
Permalink
Post by David Nebauer
TITLE = {Why {AM} and {EURISKO} appear to work},
Alternately, you could save the title in plain Unicode with that
capitalisation and refdb's BibTeX output filter would wrap any
abnormally capitalised words in braces. That way your reference can be
used in either DocBook XML/SGML or LaTeX output.
If it is a matter of wrapping uppercased words in curly brackets, this
certainly doable. Is it likely to have uppercased words in bibtex data
which are *not* supposed to be rendered in all-caps?

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Damien Jade Duff
2006-07-12 11:27:54 UTC
Permalink
Post by Markus Hoenicka
Post by David Nebauer
TITLE = {Why {AM} and {EURISKO} appear to work},
Alternately, you could save the title in plain Unicode with that
capitalisation and refdb's BibTeX output filter would wrap any
abnormally capitalised words in braces. That way your reference can be
used in either DocBook XML/SGML or LaTeX output.
If it is a matter of wrapping uppercased words in curly brackets, this
certainly doable. Is it likely to have uppercased words in bibtex data
which are *not* supposed to be rendered in all-caps?
regards,
Markus
Gidday

I imagine only where the style demands it (e.g. APA). Some journals have
titles with pretty much everything (both proper and non-proper nouns and
subitles and acronyms) capitalised. On the other hand, APA requires
nouns that aren't proper to go lower case when citing - proper nouns and
subitles can be uppercase.

I don't know how this is managed in Docbook styles or Endnote etc, but
presumably titles are used verbatim because I can't imagine any logic
funky enough to figure out whether a word is a proper noun or not
without some extra markup.

If we're going to export all capitals to bibtex verbatim sans markup
then I think we might as well, as a first approximation, rather than
trying to second guess bibtex, just set the whole title as verbatim - e.g.

TITLE = {{Why AM and EURISKO appear to work}},

Folks who use latex would probably have to turn non-proper nouns that
are capitalised into lowercase before putting them in to RefDB and hope
to seldom encounter a citation style that requires these things to be
reproduced verbatim from the original article (I don't think I have yet).

The same may possibly apply to BOOKTITLE, JOURNAL, SERIES, SCHOOL,
PUBLISHER, INSTITUTION etc.

My 2nd pennies worth.

Regards
Damien
Markus Hoenicka
2006-07-12 12:27:54 UTC
Permalink
Post by Damien Jade Duff
trying to second guess bibtex, just set the whole title as verbatim - e.g.
TITLE = {{Why AM and EURISKO appear to work}},
Folks who use latex would probably have to turn non-proper nouns that
are capitalised into lowercase before putting them in to RefDB and hope
to seldom encounter a citation style that requires these things to be
reproduced verbatim from the original article (I don't think I have yet).
RefDB uses bibliography styles even for generating LaTeX bibliographies. These
styles currently only take care of the capitalization of titles. IIRC you get
the best results if you add your data using mixed-case and then pick the proper
style for output. If we combine that with the curly brackets as shown above, you
should be able to generate at least lowercased (except the first char),
all-caps, and mixed case output.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Damien Jade Duff
2006-07-12 13:10:17 UTC
Permalink
Post by Markus Hoenicka
RefDB uses bibliography styles even for generating LaTeX bibliographies.
Oh. Right!
Post by Markus Hoenicka
should be able to generate at least lowercased (except the first char),
I don't think I'd be able to use that because presumably it would
lowercase proper nouns, acronyms and subtitle headings.
Post by Markus Hoenicka
all-caps, and mixed case output.
The latter is fine except, as before, when your article title
capitalises verbs and improper nouns. I'm happy to do it this way
though, I'll just make sure no capitalised verbs and improper nouns get
into my database and hope not to encounter citation styles that require
them.

Cheers
Damien
David Nebauer
2006-07-30 05:52:10 UTC
Permalink
Post by Damien Jade Duff
Post by Markus Hoenicka
all-caps, and mixed case output.
The latter is fine except, as before, when your article title
capitalises verbs and improper nouns. I'm happy to do it this way
though, I'll just make sure no capitalised verbs and improper nouns get
into my database and hope not to encounter citation styles that require
them.
This has been bothering me. If the default model for encoding is:

XML LaTeX
| |
| |
INPUT unicode unicode
\ /
\ /
\ /
\ /
v
STORAGE unicode
/ \
/ \
/ \
/ \
OUTPUT convert convert
| |
| |
v v
XML LaTeX


There will undoubtedly be users like Damien who make the choice to
include markup in their bibliographic data. Although it effectively
traps them into one output format the trade-off is greater control over
how that data is eventually displayed.

Would not it be simple, given the above model, to have a command line
switch for runbib and refdbib that skips the conversion step? That seems
to me an easy way to accommodate the wishes of everybody.

Regards,
David.
David Nebauer
2006-07-30 21:31:11 UTC
Permalink
Post by David Nebauer
XML LaTeX
| |
| |
INPUT unicode unicode
\ /
\ /
\ /
\ /
v
STORAGE unicode
/ \
/ \
/ \
/ \
OUTPUT convert convert
| |
| |
v v
XML LaTeX
Spaces were stripped. That should look like:

...........XML......LaTeX
............|.........|
............|.........|
INPUT.....unicode.unicode
.............\......./
..............\...../
...............\.../
................\./
.................v
STORAGE.......unicode
................/.\
.............../...\
............../.....\
............./.......\
OUTPUT....convert.convert
............|.........|
............|.........|
............v.........v
...........XML......LaTeX

Regards,
David.
Damien Jade Duff
2006-08-16 15:49:37 UTC
Permalink
Gidday gidday

As for entering unicode via jedit, there is a graphical plugin for it:
http://plugins.jedit.org/plugins/?CharacterMap
But I haven't been able to get it to work with unicode and it seems that
the current implementation is a tad broken:
http://community.jedit.org/?q=node/view/1628&pollresults%5B1252%5D=1
We have to wait for a fix to be uploaded.

The good thing about using unicode as storage is that it's a blank slate
and that means that if the community decides that some kind of markup is
in fact useful the maintainer(s!) can choose to support it at a later
point via whatever method they like and in theory they(he!) know(s)
exactly what kind of data is in the database at any given time: i.e.
more control over what is coming in and out. See I wouldn't be surprised
if the required-capitalisation markup was asked for at some time in the
future when a user wants to use latex+docbook but exploit the latex
markup, but it can be done from a unicode starting point anyway. My 8.2
cents worth

Incidentally, how do docbook users deal with this capitalisation issue?
i.e. capitalisation of acronyms and proper nouns versus of subtitles and
improper nouns. Anyone?

Peace
Damien
Post by David Nebauer
Post by Damien Jade Duff
Post by Markus Hoenicka
all-caps, and mixed case output.
The latter is fine except, as before, when your article title
capitalises verbs and improper nouns. I'm happy to do it this way
though, I'll just make sure no capitalised verbs and improper nouns get
into my database and hope not to encounter citation styles that require
them.
...........XML......LaTeX
............|.........|
............|.........|
INPUT.....unicode.unicode
.............\......./
..............\...../
...............\.../
................\./
.................v
STORAGE.......unicode
................/.\
.............../...\
............../.....\
............./.......\
OUTPUT....convert.convert
............|.........|
............|.........|
............v.........v
There will undoubtedly be users like Damien who make the choice to
include markup in their bibliographic data. Although it effectively
traps them into one output format the trade-off is greater control over
how that data is eventually displayed.
Would not it be simple, given the above model, to have a command line
switch for runbib and refdbib that skips the conversion step? That seems
to me an easy way to accommodate the wishes of everybody.
Regards,
David.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Refdb-devel mailing list
https://lists.sourceforge.net/lists/listinfo/refdb-devel
David Nebauer
2006-08-17 08:54:27 UTC
Permalink
Hi Damien,
Post by Damien Jade Duff
Incidentally, how do docbook users deal with this capitalisation
issue? i.e. capitalisation of acronyms and proper nouns versus of
subtitles and improper nouns. Anyone?
The bibliography style options for TITLE are
case: ASIS ICAPS LOWER UPPER
style: BOLD BOLDITALIC BOLDITULINE BOLDULINE ITALIC ITULINE NONE SUB
SUPER ULINE

As you can see from the case options your only choice for preserving
mixed case is to use ASIS, which of course means "as is". In that
situation it is up to the person inputting the reference in the first
place to get the case right.

Does that answer your question?

Regards,
David.
Damien Jade Duff
2006-08-17 14:22:31 UTC
Permalink
Post by David Nebauer
Post by Damien Jade Duff
Incidentally, how do docbook users deal with this capitalisation
issue? i.e. capitalisation of acronyms and proper nouns versus of
subtitles and improper nouns. Anyone?
The bibliography style options for TITLE are
case: ASIS ICAPS LOWER UPPER
style: BOLD BOLDITALIC BOLDITULINE BOLDULINE ITALIC ITULINE NONE SUB
SUPER ULINE
As you can see from the case options your only choice for preserving
mixed case is to use ASIS, which of course means "as is". In that
situation it is up to the person inputting the reference in the first
place to get the case right.
Does that answer your question?
Yes, thank you. Since Docbook users get by without this extra markup I
think we latex users should be able to get by without it too. I have no
complaints. Though it seems logically possible, I doubt I will ever
actually find an instance where anything more complicated than the above
options is required.
Peace
Damien

Damien Jade Duff
2006-07-11 17:57:28 UTC
Permalink
Gidday

As a novice latex user it seems okay; I can't envisage any difficulties.

Probably jumping the gun a bit here, but I'd like to see the following
kind of latex markup reproduced (not stripped entirely):

BIBTEX:
TITLE = {Why {AM} and {EURISKO} appear to work},

Current RIS:
TI - Why {AM} and {EURISKO} appear to work

Current RISX:
<title type="full">Why {AM} and {EURISKO} appear to work</title>

The reason is that bibtex will capitalise your bibliography according to
the current bibliography style (generally with a single leading
capital), and you need to inform bibtex that certain passages are to be
formatted verbatim.

Other than that, I can't envisage the need for keeping latex markup -
I've been manually stripping everything else. I enter most of my
references using risx (I use bibtex and RIS where provided but usually
edit them manually before submitting them to RefDB), and I don't appear
to have any entites coming back (probably because I don't know how to
use entities).
Except I think entities are used in some of the URLs - I don't know if
they're necessary or not, they're just there.

Peace
Damien
Post by David Nebauer
Hi Markus,
Post by Markus Hoenicka
1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such
code) which strips markup like boldface, superscript etc. and translates
foreign characters entered as LaTeX constructs to their Unicode equivalents.
2) Modify the code to prevent XML entities to show up in LaTeX output.
3) Add code to escape the LaTeX command characters in the LaTeX output.
The second point is a bit tricky. References imported from RIS usually do not
contain entities, but references imported from risx are likely to do. Either I
convert these entities during import, or I remove them during LaTeX export. The
former seems cleaner to me, and I think this is what you had in mind.
BibTeX ------- ---------> DocBook
| |
| |
RIS ---------+--> STORAGE ----
| (Unicode) |
| |
RISX --------- ---------> LaTeX
Whatever the input format, all references end up in the same storage
format (Unicode sans markup). This would require stripping out XML
entities and LaTeX markup. With luck you can use existing tools to do
this. The stored references can then be output in either DocBook- or
LaTeX-compatible format. This seems to be to be an elegant way of
dealing with the mishmash of input and output formats.
Regards,
David.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Refdb-devel mailing list
https://lists.sourceforge.net/lists/listinfo/refdb-devel
Markus Hoenicka
2006-07-19 20:35:59 UTC
Permalink
Hi David,
Post by David Nebauer
BibTeX ------- ---------> DocBook
| |
| |
RIS ---------+--> STORAGE ----
| (Unicode) |
| |
RISX --------- ---------> LaTeX
I've done a little source code reading and testing in order to find
out how RefDB mangles these kinds of input and output data. My results
are as follows:

1) BibTeX input
bib2ris appears to work ok with UTF-8 encoded bibtex data. You can
import the resulting RIS data as long as the input encoding is set to
UTF-8 (the current default is ISO-8859-1, but it certainly makes sense
to change that). If your bibtex data is plain ASCII with foreign and
special characters encoded as LaTeX commands, the bib2ris output
should be sent through the new refdb_latex2utf8txt script. I don't
know whether it really has 100% coverage of the character-related
LaTeX command, but it is easy to extend if need arises. With this in
mind we can import bibtex data as plain Unicode.

2) RIS input
We'd have to educate users to author their RIS datasets in UTF-8, and
to run RIS data from web sources (like Pubmed) through iconv before
adding them to RefDB. All it takes is to set the default input
encoding of refdbd for RIS data to UTF-8 (see above). Currently there
are no provisions to translate entities or LaTeX commands, but if used
correctly there should be no need to use such hacks. The result is, as
above, plain Unicode.

3) risx input
I've rediscovered a nice feature of expat (which refdbd
uses to parse all incoming XML data). The output data of expat are
always UTF-8, with all entities expanded to their Unicode
equivalents. Thus no extra conversion step is required to get rid of
entities and to store plain Unicode.

4) SGML/XML output (bibliographies, db31/tei/html backends)
"<>&" are replaced with their corresponding entities. In addition, the
current code contains replacements for &mdash; &lsquo; and &rsquo;. I
know that I was asked to add these, but I can't remember the
context. I wonder whether it would make more sense to keep these
characters as Unicode.

5) LaTeX output
There are currently no attempts to escape LaTeX command
characters. I'm about to add this code.

6) other output (RIS, screen)
No replacements. If you retrieve data as UTF-8, you'll get what you
want.


As always, I might have missed some bordercases. If you experience a
different behaviour, please let me know.

One thing that should be discussed is how easy it is for RefDB users
to author UTF-8 data, be it RIS, bibtex, or XML. You can always insert
the numeric form into XML data (e.g. &#0x00B1) but I'm afraid this
won't work for the other data formats. As an Emacs user I've got Norm
Walsh's xmlunicode.el (http://nwalsh.com/emacs/xmlchars/) which allows
to select characters from a pop-up list or from the minibuffer with
entity-name completion, and which also defines an input mode which
offers on-the-fly replacement of entities. Is there similar support
available for other editors (vim, jedit) which should be mentioned in
the manual?

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-19 20:41:50 UTC
Permalink
Post by Markus Hoenicka
2) RIS input
We'd have to educate users to author their RIS datasets in UTF-8, and
to run RIS data from web sources (like Pubmed) through iconv before
adding them to RefDB. All it takes is to set the default input
encoding of refdbd for RIS data to UTF-8 (see above). Currently there
are no provisions to translate entities or LaTeX commands, but if used
correctly there should be no need to use such hacks. The result is, as
above, plain Unicode.
Actually we can still use ISO-8859-1 or whatever as the RIS input
format as refdbd internally converts it to UTF-8 if the database uses
this encoding. Forcing UTF-8 for RIS data actually makes only sense if
people use both bibtex and RIS data.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-24 10:40:22 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
Actually we can still use ISO-8859-1 or whatever as the RIS input
format as refdbd internally converts it to UTF-8 if the database uses
this encoding. Forcing UTF-8 for RIS data actually makes only sense if
people use both bibtex and RIS data.
The advantage of making UTF-8 the default encoding at each step in the
life cycle is you don't have to remember when to specify UTF-8 encoding
and when not to. Still, the most important thing is making sure to
document clearly what the default encoding is at each step so the user
knows what is happening.

Regards,
David.
Markus Hoenicka
2006-07-24 10:56:55 UTC
Permalink
Post by David Nebauer
Hi Markus,
Post by Markus Hoenicka
Actually we can still use ISO-8859-1 or whatever as the RIS input
format as refdbd internally converts it to UTF-8 if the database uses
this encoding. Forcing UTF-8 for RIS data actually makes only sense if
people use both bibtex and RIS data.
The advantage of making UTF-8 the default encoding at each step in the
life cycle is you don't have to remember when to specify UTF-8 encoding
and when not to. Still, the most important thing is making sure to
document clearly what the default encoding is at each step so the user
knows what is happening.
This is approximately what I intended to say, but I guess I was too tired to be
as clear as I should. I plan to set all defaults to UTF-8, but users are still
free to configure their systems differently if they have a good reason.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-24 10:36:30 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
1) BibTeX input
[W]e can import bibtex data as plain Unicode.
2) RIS input
All it takes is to set the default input
encoding of refdbd for RIS data to UTF-8.
3) risx input
[N]o extra conversion step is required to get rid of
entities and to store plain Unicode.
All looks too easy so far.
Post by Markus Hoenicka
4) SGML/XML output (bibliographies, db31/tei/html backends)
"<>&" are replaced with their corresponding entities. In addition, the
current code contains replacements for &mdash; &lsquo; and &rsquo;. I
know that I was asked to add these, but I can't remember the
context. I wonder whether it would make more sense to keep these
characters as Unicode.
I fear I may have partly the cause. I was using refdb purely for
docbook and since I didn't use UTF-8 encoding for my references the only
way to preserve characters like em dash was to protect them as entities
throughout the reference's life cycle. They were not only protected
during output, as you mention above, but were protected at input also.
In moving to a more sensible unicode-based system, however, it no longer
makes any sense to replace those characters with entities.
Post by Markus Hoenicka
5) LaTeX output
There are currently no attempts to escape LaTeX command
characters. I'm about to add this code.
I see this code arrived today.
Post by Markus Hoenicka
6) other output (RIS, screen)
No replacements. If you retrieve data as UTF-8, you'll get what you want.
One thing that should be discussed is how easy it is for RefDB users
to author UTF-8 data, be it RIS, bibtex, or XML. Is there [Unicode input] support
available for other editors (vim, jedit) which should be mentioned in
the manual?
Many unicode characters (and certainly all the commonly used ones) are
entered by means of digraphs (using two or more keystrokes to specify
one character). The mnemonics for these are fairly intuitive, like 'a:'
for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx'
where 'xxxx' is the character code.
Regards,
David.
Markus Hoenicka
2006-07-24 11:04:15 UTC
Permalink
Post by David Nebauer
Post by Markus Hoenicka
4) SGML/XML output (bibliographies, db31/tei/html backends)
"<>&" are replaced with their corresponding entities. In addition, the
current code contains replacements for &mdash; &lsquo; and &rsquo;. I
know that I was asked to add these, but I can't remember the
context. I wonder whether it would make more sense to keep these
characters as Unicode.
I fear I may have partly the cause. I was using refdb purely for
docbook and since I didn't use UTF-8 encoding for my references the only
way to preserve characters like em dash was to protect them as entities
throughout the reference's life cycle. They were not only protected
during output, as you mention above, but were protected at input also.
In moving to a more sensible unicode-based system, however, it no longer
makes any sense to replace those characters with entities.
I see. Then I'll remove these entities again.
Post by David Nebauer
Post by Markus Hoenicka
5) LaTeX output
There are currently no attempts to escape LaTeX command
characters. I'm about to add this code.
I see this code arrived today.
Yes. I didn't get round to announce it, but please give it a real-world test to
see whether it works ok.
Post by David Nebauer
Many unicode characters (and certainly all the commonly used ones) are
entered by means of digraphs (using two or more keystrokes to specify
one character). The mnemonics for these are fairly intuitive, like 'a:'
for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx'
where 'xxxx' is the character code.
Is there a link to some doc that explains this, by any chance? I thought about
adding something like a tip box to the docs that briefly explains how to deal
with Unicode characters for the most popular editors. I'd like to add URLs for
further information.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-24 15:38:59 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
Post by David Nebauer
Many unicode characters (and certainly all the commonly used ones) are
entered by means of digraphs (using two or more keystrokes to specify
one character). The mnemonics for these are fairly intuitive, like 'a:'
for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx'
where 'xxxx' is the character code.
Is there a link to some doc that explains this, by any chance? I thought about
adding something like a tip box to the docs that briefly explains how to deal
with Unicode characters for the most popular editors. I'd like to add URLs for
further information.
Vim documentation is the ultimate triumph of substance over style. In
aggregate it contains every fact you could or would ever want to know
about Vim. The problem is it's almost impossible to find the
information you want. On the rare occasion you do it is so dry and
technical as to be a foreign language altogether.

There actually appear to be three general methods of entering unicode
characters:

1. Digraphs

I mentioned these in my previous post. This is the easiest method to
learn and remember. Documentation is here:
<http://vimdoc.sourceforge.net/htmldoc/digraph.html>. Same
documentation is available within Vim by typing ':h digraphs'. Type
':digraphs' for a list of digraphs.

2. Keymaps

Frankly, I've skimmed this topic a few times and can't make head nor
tail of it. It claims unicode characters can be entered as combinations
of other characters (sounds somewhat like digraphs but apparently is
different). Documentation is here:
<http://vimdoc.sourceforge.net/htmldoc/mbyte.html>. Same documentation
is available within Vim by typing ':h multibyte'.

3. Direct entry

This is done by 'Ctrl-v u xxxx' where 'xxxx' is the hex number of a
unicode character. Documentation is included in multi-byte help at
<http://vimdoc.sourceforge.net/htmldoc/mbyte.html#utf-8-typing> or
within Vim by typing ':h utf-8-typing'.

Regards,
David.
David Nebauer
2006-07-12 09:45:12 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
One interesting consequence of this is that author names may contain
non-ascii characters. If, when new references are added to refdb, there
is no citation key specified, the citekey is constructed by mangling
primary author surname and year. If citekey is restricted to ascii
characters then non-ascii author surname characters would have to be
stripped or converted (e.g., ä -> a, ß -> ss).
Currently non-ascii characters are simply stripped. You always have
the option to specify a citation key explicitly when adding a
reference, using any reasonable translation of the foreign characters
to ascii.
I'd like to focus on this point again. I personally allow refdb to
generate the citekey for me, mainly because it will automatically append
'a', 'b', etc. if there is danger of duplication. Automatically
stripping non-ascii characters from authors with foreign characters will
lead to some unusual results. A recent publication from our old
workhorse 'Häßler' might produce the citekey 'Hler2006'.

There are tools around which attempt to convert sensibly from unicode to
ascii. Here is an example using the tool 'konwert':
---------------------------------------------------------------------------------------
$ cat name
Häßler, Günter
$ cat name | konwert UTF8-ascii
Hassler, Gunter
$
---------------------------------------------------------------------------------------

Use of this (or a similar) tool would result in the much more
satisfactory, and easy to remember, default citekey of 'Hassler2006'.
It should be a fairly simple to add this additional conversion step.

Regards,
David.
Markus Hoenicka
2006-07-12 12:20:02 UTC
Permalink
Hi David,
Post by David Nebauer
Use of this (or a similar) tool would result in the much more
satisfactory, and easy to remember, default citekey of 'Hassler2006'.
It should be a fairly simple to add this additional conversion step.
It is. Unfortuntately the konwert sources are a bit hard on the eyes because
they're in Polish but I'll try to steal from that anyway.

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-12 14:25:18 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
It is. Unfortuntately the konwert sources are a bit hard on the eyes because
they're in Polish but I'll try to steal from that anyway.
A lot of the heavy lifting is done by the (executable) filters. On my
system they live in '/usr/share/konwert/filters/'.

UTF8-ascii is a bash script:
--------------------------------------------------------------------------------------
#!/bin/bash -

VARIANT_bg='
Щ SHT
щ sht
' VARIANT_de='
Ä AE
Ö OE
Ü UE
ä ae
ö oe
ü ue
' VARIANT_hr='
Đ DJ
đ dj
' VARIANT_vi='
À A`
Á A'\''
 A^
à A~
È E`
É E'\''
Ê E^
Ì I`
Í I'\''
Ò O`
Ó O'\''
Ô O^
Õ O~
Ù U`
Ú U'\''
Ý Y'\''
à a`
á a'\''
â a^
ã a~
è e`
é e'\''
ê e^
ì i`
í i'\''
ò o`
ó o'\''
ô o^
õ o~
ù u`
ú u'\''
ý y'\''
Ă A(
ă a(
Đ DD
đ dd
Ĩ I~
ĩ i~
Ũ U~
ũ u~
' VARIANT1_bg='
Ъ Y
ъ y
' VARIANT1_ua='
И Y
и y
' REPLACE='?' MIME=us-ascii

if [ "$FILTERM" = out ]
then
NPOJED=
else
NPOJED=1
fi
FORMAT=
HTMLCHAR=
POPRAWKI=
for A in $ARG
do
case "$A" in
(1) NPOJED=;;
(html) FORMAT=html;;
(htmldec|htmlhex) FORMAT=html; HTMLCHAR=${A#html};;
(tex) FORMAT=tex;;
(*)
if [ -x "${0%/*}/../aux/argcharset/$A" ]
then
POPRAWKI=${POPRAWKI:+$POPRAWKI | }${0%/*}/../aux/argcharset/$A
fi
VARIANT=VARIANT_$A; APPROX="${!VARIANT} $APPROX"
VARIANT=VARIANT1_$A; APPROX1="${!VARIANT} $APPROX1"
;;
esac
done

if [ "$POPRAWKI" ]
then
"$SHELL" -c "$POPRAWKI"
else
cat
fi |
case "$FORMAT" in
(html)
"${0%/*}/../aux/fixmeta" us-ascii |
if [ "$HTMLCHAR" ]
then
"${0%/*}/UTF8-html$HTMLCHAR"
else
trs -e '\}\[@&<>\] @' \
${NPOJED:+-e} ${NPOJED:+"$APPROX"} \
-e "$APPROX1" \
${NPOJED:+-f} ${NPOJED:+"${0%/*}/../aux/UTF8-ascii"} \
-f "${0%/*}/../aux/UTF8-ascii1" \
-e "\300\-\377 ${REPLACE:-?} \200\-\277 \!" |
trs -e '@@ @ @& & @< < @> > & &amp; < &lt; > &gt;'
fi
;;
(tex)
trs -e '\}\[@\#$%&\\^_{|}~\] @' \
-f "${0%/*}/../aux/UTF8-tex" \
-e "$APPROX" \
-e "$APPROX1" \
-f "${0%/*}/../aux/UTF8-ascii" \
-f "${0%/*}/../aux/UTF8-ascii1" \
-e "\300\-\377 ${REPLACE:-?} \200\-\277 \!" |
trs -e '@@ @ @\# \# @$ $ @% % @& & @\\ \\ @^ ^ @_ _ @{ { @| | @} } @~ ~
\# \\\# $ \\$ % \\% & \\& \\ $\\backslash$ ^ \\^{} _ \\_ { \\{ | $|$ }
\\} ~ \\~{}'
;;
(*)
trs ${NPOJED:+-e} ${NPOJED:+"$APPROX"} \
-e "$APPROX1" \
${NPOJED:+-f} ${NPOJED:+"${0%/*}/../aux/UTF8-ascii"} \
-f "${0%/*}/../aux/UTF8-ascii1" \
-e "\300\-\377 ${REPLACE:-?} \200\-\277 \!"
;;
esac
--------------------------------------------------------------------------------------

There's bash wizardry in there I can't even begin to fathom.

Regards,
David.
Markus Hoenicka
2006-08-16 21:42:34 UTC
Permalink
Hi David,
Post by David Nebauer
I'd like to focus on this point again. I personally allow refdb to
generate the citekey for me, mainly because it will automatically append
'a', 'b', etc. if there is danger of duplication. Automatically
stripping non-ascii characters from authors with foreign characters will
lead to some unusual results. A recent publication from our old
workhorse 'Häßler' might produce the citekey 'Hler2006'.
I've tried to resolve this problem by running the citekeys through an
iconv conversion (from UTF-8 to ASCII with transliteration switched
on). Only then invalid characters are stripped from the strings. I
hope this will improve the automatically created citation keys.

iconv uses a latex-style transliteration of umlauts and other
non-ASCII characters. E.g. our beloved 'Häßler' is converted to
'H"assler'. RefDB has to strip the '"' from the latter as it must not
appear in XML attribute values, hence you'll end up with 'Hassler2006'
instead of the abovementioned 'Hler2006'. It is still not the correct
German transliteration, which would call for 'Haessler2006', but I
think we're close enough without having to hand-code a boatload of
special cases.


regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
Markus Hoenicka
2006-07-12 21:06:38 UTC
Permalink
Hi David,

David Nebauer writes:
Markus Hoenicka
2006-07-12 23:30:55 UTC
Permalink
Hi David,
latex2utf8txt converts LaTeX files to UTF-8 text, removes line
breaks from paragraphs
I used parts of this script to hack a tex2mail replacement. It isn't
more than a collection of regular expression substitutions which is to
be used as a post-processing filter of bib2ris output. I've added it
to the subversion repository, but it is not yet included into the
build system. Please test it against your data and modify it as
needed or let me know what else should be covered.

Either update your svn sources, or visit the svn web interface:

http://svn.sourceforge.net/viewcvs.cgi/refdb/refdb/trunk/scripts/refdb_latex2utf8txt?view=log

The script does the following:

- replace foreign characters encoded as {\..} constructs with their
UTF-8 counterparts

- remove {\xy ...} commands, leaving only the enclosed text

- remove non-escaped curly brackets

- unescape escaped command characters: # $ % & ~ _ ^ \ { }

- convert the LaTeX dashes '--' and '---' to '-'

regards,
Markus
--
Markus Hoenicka
***@cats.de
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de
David Nebauer
2006-07-06 23:58:17 UTC
Permalink
Hi Markus,
Post by Markus Hoenicka
I'll set up a test case and see what happens. I recall it might be
necessary to specify a default database anyway (i.e. use the -d switch
of runbib) even if you specify a database in each citation. Does the
problem persist if you set a default database?
Yes. I tried it with and without specifying a default database for
'runbib'.

Regards,
David.
Loading...