[UEB Linguistics] Many-to-one Unicode mappings

Jack Maartman ueblinguistics@nbp.org
Mon, 13 Jun 2005 11:22:28 -0700


This is a multi-part message in MIME format.

------=_NextPart_000_0004_01C5700A.2E9AADC0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Hi George:

Robert:  Please do not hesitate to corect or amend my explanation.

Unicode was constructed in hexadecimal groupings, as an extension of the
ascii code.  The characters  decimal 32 to 128 hex 20 to 40 being the  roman
alphabet and punctuation, and numbers familiar to all of us.  The extended
character set with what has often been referred to as nonprintable
characters were pressed into service by Unicode and often given assignments
in European latin alphabets.  this character set has been designated
ISO-88-59. An 8-bit character set is limited by the 256 so-called extended
ascii character set.  Unicode has adopted a number of other encodings, but
UTF-16 which allows for 15 or is it 16 bits per character. exponentially
increasing  the number of assignable codepoints for characters used
throughout the world's scripts.

What the Unicode Consortium did to make things a little easier for everybody
was to start an alphabet or what the standard calls a block of characters
with an even number. Thus, braille is encoded in the standard and begins at
utf 2800 through 28ff., The Cyrillic alphabet from Hex 0400-04FF.  My choice
of braille as an example is an unfortunate one, because practically no one
would ever write in it, as braille is mapped onto the basic ascii character
set.  Unfortunately throughout the English speaking world, there is no
conformity as to the braille characters apart from upper case or lower case
letters.

My not mentioning that there are many glyphs that are look alikes,
especially when Jean was talking about the encoding of symbols used in
Nigerian languages, is due  in large part because I didn't know from what
blocks they were drawn.  Robert--discriminating linguist--pointed this out,
and I hope clarifies some of the confusion.

UTF-8 which allows every character to be printed, has to rely on a lengthy
series of characters for the higher blocks, resulting in what is termed
decomposition--a technique I still don't understand-- and Unicode regardless
of what encoding needs a special font to display them.  The consortium
assigns characters and leaves the way they are displayed to third party font
developers.  Thus there is as yet no single font to display every character,
and the consortium in its last release of the standard had to use a great
deal of ingenuity to make its many examples display propperly.  The standard
is on the web in unlocked PDF files, and the many examples unfortunately
don't render well in html or text.  I am including a list of the "blocks"
which I hope will illustrate my point.  Both Robert and George have pointed
out that Dictionaries for the representation of Unicode in Braille or
synthetic speech have been built into the latest release of Jaws.  This
makes life infinitely easier  to display a text saved as "unicode text"
unreadable by any assistive technology and of course unembossable.

UEB is predicated to the use of Unicode assignments for every character.
Thus, even if we use an escape mechanism to represent the IPA, and its
congeners, as I believe we should, it is still nicely encodable, and once
the variations between British and North American practice, (BANA 1977) has
been cleared up,  Duxbury will be able to eencode these characters into  a
translation table.  I think this issue should be dealt with by users of the
system.  The only possible obstacle to the completion  of the task, is
sorting out what I hope are only minor variants in use on both sides of the
Atlantic.

Anybody who wishes to use a "borrowed character from another language might
well use an already existing braille representation.  Braille assignments
were never based on Unicode until the inception of UEB, but a "best guess
should suffice.  This has no impact for the IPA which is encoded in its own
blocks, namely 0250-02af, and 1D00-1D7F.  There maybe other symbols in other
blocks.

Robert's dichotemy is very relevant, but frankly the confusion surrounding
this list's mandate is understandable.  Linguistics itself, is rife with
competing theories and methodologies, resulting in sometimes hotly debated
ideoloical perspectives. To add the orthography as part of our mandate is
appropriate, especially if we consider that braille as an orthography is
unique employing in it's six dot form sixty-four invariable characters, with
virtually no  single designation for a given character.

I think the use of Robert's rules, as presented in a document found on the
ICEB web page,  is germain.

Finally, even if the use of an escape mechanism to write
phonemics/phonetics, is used, which I believe is best practice, it still
falls under the aegis of a unified braille code.  In a discussion I had with
Joe, he considered this a serious and highly viable solution.  , to

Jean mentioned a while back that there were two conflicting braille symbol
sets in Nigeria which played havoc  for matriculating students.  My best
guess is that the second competing code is the one put forth by Bana in it's
textbook code published in 1997, which as Robert has amply demonstrated has
absolutely no validity.  Someone probably needs to determine how much this
code is actually in use by transcribers, and for what audience.  I will do
my best to find out.

Although I am repeating myself here, the code(s) used for the transcription
of phonetics have been with us for a long time, and if it ain't broke, there
is no point in substantially altering it.  Isolating this code from
"literary braille assignments, E.G. EnForeign languages in English context,
should simplify Bill's original guidelines, which we have to assume
elucidated the original mandate for our committee.

I would greatly welcome any feedback, and hope there is not too much
redundancy in this message.

My best to all

Jack

----- Original Message ----- 
From: "George Bell" <george@techno-vision.co.uk>
To: <ueblinguistics@nbp.org>
Sent: Monday, June 13, 2005 2:25 AM
Subject: RE: [UEB Linguistics] Many-to-one Unicode mappings


> Hi Robert,
>
> You have chosen and interesting example, which may be a good
> case to take as an example for study and some explanation.
>
> U+01dd is described as - "Latin small letter turned e"
> U+04d9 is described as - "Cyrillic small letter schwa"
>
> Visually, to the untrained eye, the Cyrillic version appears
> to simply be a bold version of the Latin.
>
> So clearly there is an intentional difference of some kind,
> aside from the description.
>
> Can anyone explain please?
>
> George.
>
> > -----Original Message-----
> > From: ueblinguistics-admin@nbp.org
> > [mailto:ueblinguistics-admin@nbp.org] On Behalf Of Robert
> Englebretson
> > Sent: 13 June 2005 00:11
> > To: ueblinguistics@nbp.org
> > Subject: [UEB Linguistics] Many-to-one Unicode mappings
> >
> > Dear All,
> >
> > Anticipating one of the possible objections to the
> proposal
> > in my last message, I want to draw your attention to a
> fact
> > about Unicode that I haven't noticed anyone on the list
> > discussing.  Namely, there are plenty of cases of
> many-to-one
> > mapping in the Unicode set.  In other words, the same
> > (one) print glyph is represented by several different
> (many)
> > Unicode values.
> > For instance, the schwa I gave as an example in the table
> in
> > my last message.  As a phonetic symbol, its Unicode value
> is
> > U+0259.  This same print glyph has additional Unicode
> values
> > associated with the orthographies of various languages.
> When
> > it's used in Nigerian orthographies it's U+01DD, when it's
>
> > used in Cyrillic it's U+04D9. In other words, it is the
> exact
> > same print glyph, but it is represented by three different
>
> > Unicode codepoints, depending on what it's being used for.
>
> > The reason this is relevant for purposes of the current
> > discussion is:  having phonetics being considered a
> secondary
> > code will not directly impact the orthographic decisions
> made
> > about non-English letters within UEBC proper.  I.e. the
> > Braille phonetic symbol for schwa (U+0259) can easily be a
>
> > different symbol than whatever is chosen for UEBC
> > transcription of Nigerian schwa (U+01DD).  In other words,
> we
> > won't have a situation where phonetic Unicode codepoints
> are
> > competing with UEBC orthographic Unicode codepoints for
> > symbols, since they're actually different codepoints in
> the
> > first place.
> >
> > --Robert
> >
> > _______________________________________________
> > UEBlinguistics mailing list
> > UEBlinguistics@nbp.org
> > http://nbp.org/mailman/listinfo/ueblinguistics
> >
> _______________________________________________
> UEBlinguistics mailing list
> UEBlinguistics@nbp.org
> http://nbp.org/mailman/listinfo/ueblinguistics

------=_NextPart_000_0004_01C5700A.2E9AADC0
Content-Type: text/plain;
	name="blocks.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="blocks.txt"

# Blocks-4.0.0.txt=0A=
# Correlated with Unicode 4.0=0A=
# Note: The casing of block names is not normative.=0A=
#       For example, "Basic Latin" and "BASIC LATIN" are equivalent.=0A=
#=0A=
# Code points not explicitly listed in this file are given the value =
No_Block.=0A=
#=0A=
# Start Code..End Code; Block Name=0A=
0000..007F; Basic Latin=0A=
0080..00FF; Latin-1 Supplement=0A=
0100..017F; Latin Extended-A=0A=
0180..024F; Latin Extended-B=0A=
0250..02AF; IPA Extensions=0A=
02B0..02FF; Spacing Modifier Letters=0A=
0300..036F; Combining Diacritical Marks=0A=
0370..03FF; Greek and Coptic=0A=
0400..04FF; Cyrillic=0A=
0500..052F; Cyrillic Supplementary=0A=
0530..058F; Armenian=0A=
0590..05FF; Hebrew=0A=
0600..06FF; Arabic=0A=
0700..074F; Syriac=0A=
0780..07BF; Thaana=0A=
0900..097F; Devanagari=0A=
0980..09FF; Bengali=0A=
0A00..0A7F; Gurmukhi=0A=
0A80..0AFF; Gujarati=0A=
0B00..0B7F; Oriya=0A=
0B80..0BFF; Tamil=0A=
0C00..0C7F; Telugu=0A=
0C80..0CFF; Kannada=0A=
0D00..0D7F; Malayalam=0A=
0D80..0DFF; Sinhala=0A=
0E00..0E7F; Thai=0A=
0E80..0EFF; Lao=0A=
0F00..0FFF; Tibetan=0A=
1000..109F; Myanmar=0A=
10A0..10FF; Georgian=0A=
1100..11FF; Hangul Jamo=0A=
1200..137F; Ethiopic=0A=
13A0..13FF; Cherokee=0A=
1400..167F; Unified Canadian Aboriginal Syllabics=0A=
1680..169F; Ogham=0A=
16A0..16FF; Runic=0A=
1700..171F; Tagalog=0A=
1720..173F; Hanunoo=0A=
1740..175F; Buhid=0A=
1760..177F; Tagbanwa=0A=
1780..17FF; Khmer=0A=
1800..18AF; Mongolian=0A=
1900..194F; Limbu=0A=
1950..197F; Tai Le=0A=
19E0..19FF; Khmer Symbols=0A=
1D00..1D7F; Phonetic Extensions=0A=
1E00..1EFF; Latin Extended Additional=0A=
1F00..1FFF; Greek Extended=0A=
2000..206F; General Punctuation=0A=
2070..209F; Superscripts and Subscripts=0A=
20A0..20CF; Currency Symbols=0A=
20D0..20FF; Combining Diacritical Marks for Symbols=0A=
2100..214F; Letterlike Symbols=0A=
2150..218F; Number Forms=0A=
2190..21FF; Arrows=0A=
2200..22FF; Mathematical Operators=0A=
2300..23FF; Miscellaneous Technical=0A=
2400..243F; Control Pictures=0A=
2440..245F; Optical Character Recognition=0A=
2460..24FF; Enclosed Alphanumerics=0A=
2500..257F; Box Drawing=0A=
2580..259F; Block Elements=0A=
25A0..25FF; Geometric Shapes=0A=
2600..26FF; Miscellaneous Symbols=0A=
2700..27BF; Dingbats=0A=
27C0..27EF; Miscellaneous Mathematical Symbols-A=0A=
27F0..27FF; Supplemental Arrows-A=0A=
2800..28FF; Braille Patterns=0A=
2900..297F; Supplemental Arrows-B=0A=
2980..29FF; Miscellaneous Mathematical Symbols-B=0A=
2A00..2AFF; Supplemental Mathematical Operators=0A=
2B00..2BFF; Miscellaneous Symbols and Arrows=0A=
2E80..2EFF; CJK Radicals Supplement=0A=
2F00..2FDF; Kangxi Radicals=0A=
2FF0..2FFF; Ideographic Description Characters=0A=
3000..303F; CJK Symbols and Punctuation=0A=
3040..309F; Hiragana=0A=
30A0..30FF; Katakana=0A=
3100..312F; Bopomofo=0A=
3130..318F; Hangul Compatibility Jamo=0A=
3190..319F; Kanbun=0A=
31A0..31BF; Bopomofo Extended=0A=
31F0..31FF; Katakana Phonetic Extensions=0A=
3200..32FF; Enclosed CJK Letters and Months=0A=
3300..33FF; CJK Compatibility=0A=
3400..4DBF; CJK Unified Ideographs Extension A=0A=
4DC0..4DFF; Yijing Hexagram Symbols=0A=
4E00..9FFF; CJK Unified Ideographs=0A=
A000..A48F; Yi Syllables=0A=
A490..A4CF; Yi Radicals=0A=
AC00..D7AF; Hangul Syllables=0A=
D800..DB7F; High Surrogates=0A=
DB80..DBFF; High Private Use Surrogates=0A=
DC00..DFFF; Low Surrogates=0A=
E000..F8FF; Private Use Area=0A=
F900..FAFF; CJK Compatibility Ideographs=0A=
FB00..FB4F; Alphabetic Presentation Forms=0A=
FB50..FDFF; Arabic Presentation Forms-A=0A=
FE00..FE0F; Variation Selectors=0A=
FE20..FE2F; Combining Half Marks=0A=
FE30..FE4F; CJK Compatibility Forms=0A=
FE50..FE6F; Small Form Variants=0A=
FE70..FEFF; Arabic Presentation Forms-B=0A=
FF00..FFEF; Halfwidth and Fullwidth Forms=0A=
FFF0..FFFF; Specials=0A=
10000..1007F; Linear B Syllabary=0A=
10080..100FF; Linear B Ideograms=0A=
10100..1013F; Aegean Numbers=0A=
10300..1032F; Old Italic=0A=
10330..1034F; Gothic=0A=
10380..1039F; Ugaritic=0A=
10400..1044F; Deseret=0A=
10450..1047F; Shavian=0A=
10480..104AF; Osmanya=0A=
10800..1083F; Cypriot Syllabary=0A=
1D000..1D0FF; Byzantine Musical Symbols=0A=
1D100..1D1FF; Musical Symbols=0A=
1D300..1D35F; Tai Xuan Jing Symbols=0A=
1D400..1D7FF; Mathematical Alphanumeric Symbols=0A=
20000..2A6DF; CJK Unified Ideographs Extension B=0A=
2F800..2FA1F; CJK Compatibility Ideographs Supplement=0A=
E0000..E007F; Tags=0A=
E0100..E01EF; Variation Selectors Supplement=0A=
F0000..FFFFF; Supplementary Private Use Area-A=0A=
100000..10FFFF; Supplementary Private Use Area-B=0A=

------=_NextPart_000_0004_01C5700A.2E9AADC0--