[»]
Re: Recode
by Bill Poser - Jan 13th 2006 09:42:49
Recode and uni2ascii are complementary. Briefly put, Recode converts from
one encoding to another (where the expectation is that the target
character set will be the same as, or a superset of, the source character
set), whereas Uni2ascii converts between UTF-8 Unicode and ASCII
representations of Unicode. In practical terms, Uni2ascii will not
convert between, say, ASCII and EBCDIC,
which Recode will, whereas Recode will not convert between Unicode and the
\x{00E9} format, which Uni2ascii will. (I should say that Recode lists but
does not explain the encodings that it knows so it is not always easy to
figure out what it handles. It is possible that it can handle things that
I am not aware of. But at least as far as I can tell, it does not handle
the textual representations of Unicode characters that Uni2ascii
handles.)
Thus, if you've got a text in, say, TIS-620 (the Thai national standard)
and you want to get it into Unicode, you would use Recode. If you want to
include that Thai text in a blog posting using Movable Type, which is not
8-bit safe, you would use Uni2ascii to convert your Unicode version of the
Thai text to HTML numeric character references. Similarly, if you wanted to
include that Thai text as a string in a program in Java, Python, Scheme, or
Tcl, you would use uni2ascii to convert the Unicode to the \uxxxx format.
My conception of the difference is this. When you have the same character
set but different associations between the characters and the integers,
conversion between the two is pure encoding conversion. ASCII and EBCDIC
are different encodings of the same character set; converting between them
is a matter of encoding conversion.
On the other hand, when you have radically different character sets,
conversion from one to the other is a matter of transliteration.
Transliteration may be perfect, or nearly so, if both writing systems have
been adapted for the same language (e.g. in the case of the roman and
cyrillic writing systems for Serbo-croatian) or quite imperfect, (e.g.
when Vietnamese is written using only the English alphabet.)
A third situation is when you use escape sequences to represent the
characters of one character set in another.
That's what we're doing hen we use the sequence of ASCII characters
\x{00E9} to represent the Unicode character U+00E9 "Latin small letter e
with acute".
Recode is basically intended to handle encoding conversion. Uni2ascii, on
the other hand, is aimed at the third case, the representation of Unicode
characters by ASCII escape sequences. Other programs (e.g. my own Xlit)
deal with transliteration.
Of course, the division I've made here, while, I think, the one that
people usually make, is not quite so simple, since what are generally
thought of as different encodings of the same character set may in fact
use somewhat different character sets. For example, decomposed Unicode
uses sequences of two or more Unicode characters to represent what in
other encodings are single characters. For example, e with acute accent is
a single character in ISO-8859-1 (0xE9) but is a two character sequence
(0x0065 0x0301) in non-composed Unicode, where it is treated as plain e
followed by acute accent. Encoding conversion programs like recode are
therefore, in the strict sense, doing more than pure encoding conversion.
At one level, all of these conversions are the same since they can all be
treated as mappings of one set of byte strings to another. However, there
is a conceptual difference among them that, with some fuzzy edges, seems
to correspond to the functionality of the software designed to handle
them.
Returning to practicalities, Uni2ascii and Recode also provide different
approaches to and degrees of control over disparities between character
sets, e.g. what to do with characters with diacritics when converting to
ASCII.
[reply]
[top]