UNICODE - meaning of word
Rozmiar: 8938 bajtów


UNICODE



#REDIRECT Unicode

UniCode



#REDIRECT Unicode

Unicode



In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. This includes all Writing systems still in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematics, linguistics and APL programming language. The creation of Unicode is an ambitious project to replace existing character encodings, many of which are short in size and problematic in multilingual environments. Despite technical problems and limitations and criticism on process, today Unicode is considered the most complete character set and one of the largest, and has become the dominant encoding scheme in internationalization of Computer software and multilingual environments. Many recent standards such as XML, as well as system software such as operating systems, have adopted Unicode as an underlying scheme to represent text. == Origin and development == It is the explicit aim of Unicode to transcend the limitations of traditional character encodings such as those defined by the ISO 8859 standard, which are used in the various countries of the world, but are largely incompatible with each other. One problem with traditional character encodings is that they allow for bilingual computer processing (usually Roman characters and the local language), but not for multilingual computer processing (computer processing of arbitrary languages mixed with each other). Unicode in intent encodes the underlying characters and not variant glyphs for such characters. In the case of Chinese characters, this sometimes leads to controversies over what is the underlying character and what is the variant glyph (see Han unification). Unicode's role in text-processing is to provide a unique code point — not a glyph — for each character. In other words, Unicode is used to represent a character in an abstract way, and leaves the visual rendering (size, shape or style) to another program, such as a web browser or word processor. This simple aim is greatly complicated by another aim, which is to provide lossless conversion amongst different existing encodings in order to ease the transition. The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew language, and left-to-right scripts). In 1997 a proposal was made by Michael Everson to encode the characters of the Klingon language in Plane 1 of ISO 10646. The proposal was rejected in 2001 as "inappropriate for encoding" — not because the proposal was technically faulty, but because users of Klingon normally read and write and exchange data in Latin transliteration. The Elves (Middle-earth) scripts Tengwar and Cirth from J. R. R. Tolkien's Middle-earth setting were proposed for inclusion in Plane 1 in 1993. The draft was withdrawn to incorporate changes suggested by Tolkienists, and is as of 2004 still under consideration. == Mapping and encodings == === Standard === The Unicode Consortium, based in California, is the organization that develops the Unicode standard. It is an organization open to any company or individual willing to pay the membership dues. Members include virtually all of the main computer software and hardware companies with any interest in text processing standards, such as Apple Computer, Microsoft, International Business Machines, Xerox, Hewlett-Packard, Adobe Systems and many others. The Consortium first published ''The Unicode Standard'' (ISBN 0321185781) in 1991, and continues to develop standards based on that original work. Unicode was developed in conjunction with the ISO and it shares its character repertoire with ISO/IEC 10646. Unicode and ISO/IEC 10646 are equivalent as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering, in depth, topics such as bitwise encoding, collation, and rendering, and enumerating a multitude of character properties, including those needed for BiDi support. The two standards also have slightly different terminology. ==== Unicode revision history ==== * 1991 Unicode 1.0 * 1993 Unicode 1.1 * 1996 Unicode 2.0 * 1998 Unicode 2.1 * 1999 Unicode 3.0 * 2001 Unicode 3.1 * 2002 Unicode 3.2 * 2003 Unicode 4.0 * 2005 Unicode 4.1 === Storage transfer and processing === So far, it has only been said that Unicode is a means to assign a unique number for all characters used by humans in written language. How these numbers are stored in text processing is another matter; problems result from the fact that much software in the West has so far been written to deal with 8-bit character encodings only, and Unicode support has only been added slowly in recent years. Similarly, in the East the double-byte character encodings cannot even in principle encode more than 65,536 characters, and in practice the limit imposed by the architectures chosen is much lower. This is not enough for the needs of scholars of the Chinese language alone. The internal logic of much 8-bit legacy software typically permits only 8 bits for each character, making it impossible to use more than 256 code points without special processing, and 16-bit software is limited to some tens of thousands of characters, while Unicode is already up to more than 90,000 encoded characters. Several mechanisms have therefore been suggested to implement Unicode; which one is chosen depends on available storage space, source code compatibility, and interoperability with other systems. The mapping methods are called the UTF (Unicode Transformation Format) and UCS (Universal Character Set) encodings. Among them are UTF-32, UCS-4, UTF-16, UCS-2, UTF-8, UTF-EBCDIC and UTF-7. The numbers indicate the number of bits in one unit, for UTF encodings, or bytes, for UCS encodings. In UTF-32 or UCS-4, one unit is enough for any character; in the other cases, a variable number of units is used for each character. UTF-8 is the de-facto standard encoding for interchange of unicode text with UTF-16 and UTF-32 being used mainly for internal processing. The Unicode Byte Order Mark (BOM) is specified for use at the beginnings of text files in UCS-2 and UTF-16 encodings. It has been adopted by some software developers for other encodings, including UTF-8, which does not need an indication of byte order. In this case it is an attempt to mark the file as containing Unicode text. The BOM is code point U+FEFF, which has the important property of being unambiguously interpretable regardless of which Unicode encoding is used. The units FE and FF never appear in UTF-8, U+FFFE (the result of byte-swapping U+FEFF) is not a legal character, and U+FEFF is the Zero-Width No-Break Space (a character with no appearance and no effect other than preventing formation of ligature (typography)s). The same character converted to UTF-8 becomes the byte sequence EF BB BF. See also: Mapping of Unicode characters === Ready-made vs. composite characters === Unicode includes a mechanism for modifying character shape and so greatly extending the supported glyph repertoire. This is the use of combining diacritical marks. They are inserted after the main character (it is possible to stack several combining diacritics over the same character). However, for reasons of compatibility, Unicode also includes a large quantity of precomposed characters. So in many cases there are many ways of encoding the same character. To deal with this, Unicode provides the mechanism of canonical equivalence. The situation with Hangul is similar. Unicode provides the mechanism for composing Hangul syllables with Hangul Jamo. However, the precomposed Hangul syllables (11,172 of them) are also provided. The CJK ideographs currently are encoded only in their precomposed form. Still, most of those ideographs are evidently made up of simpler elements, so in principle it would be possible to decompose them just as it is done with Hangul. This would greatly reduce the number of required codepoints, while allowing the display of virtually every conceivable ideograph (and so doing away with all problems of the Han unification). A similar idea is used for some input methods, such as Cangjie method and Wubi method. However, attempts to do this for character encoding have stumbled over the fact that ideographs are not as simply decomposed or as regular as they seem. Combining marks, like the complex script shaping required to properly render Arabic text and many other scripts, are usually dependent on complex font technologies, like OpenType (by Adobe and Microsoft), Graphite (by SIL International), and Apple Advanced Typography (by Apple Computer), by which a font designer includes instructions in a font telling software how to properly output different character sequences. Another method sometimes employed in fixed-width fonts is to place the combining mark's glyph before its own left sidebearing; this method, however, only works for some diacritics and stacking will not occur properly. As of 2004, most software still cannot reliably handle many features not supported by older font formats, so combining characters generally will not work correctly. Hypothetically, (precomposed e with macron and acute above) and (e followed by the combining macron above and combining acute above) are identical in appearance, both giving an e with macron and acute accent, but appearance can vary greatly across software applications. Also underdots, as needed in Indic Romanization, will often be placed incorrectly or worse. Sample: : Of course, this is in fact not a weakness in Unicode itself, but only uncovers gaps in rendering technology and fonts. === Issues === Some people, mostly in Japan, oppose Unicode in general, claiming technical limitations and political problems in process, which people working on the Unicode standard claim are simply misunderstandings of the Unicode standard and the process by which it was created. The most common mistake, according to this view, is confusion between abstract characters and their highly variable visual forms (glyphs). On the other hand, whereas Chinese can readily read most types of glyphs used by Japanese or Koreans, Japanese often can recognize only a particular variant. Unicode has been decried as a plot against Asian cultures perpetrated by Westerners with no understanding of the characters as used in Chinese, Korean, and Japanese, in spite of the presence of a majority of experts from all three countries in the Ideographic Rapporteur Group. The IRG advises the consortium and ISO on additions to the repertoire and on Han unification, the identification of forms in the three languages which will be treated as stylistic variations of the same historical character. This unification is one of the most controversial aspects of Unicode. Unicode is criticized for failing to allow for older and alternate forms of kanji, which, it is said, complicates the processing of ancient Japanese and uncommon Japanese names, although it follows the recommendations of Japanese scholars of the language and of the Japanese government. There have been several attempts to create an alternative to Unicode. [http://www-106.ibm.com/developerworks/unicode/library/u-secret.html] Among them are TRON (encoding) (although it is not widely adopted in Japan, some, particularly those who need to handle historical Japanese text, favor this), UTF-2000 and Giga Character Set (GCS). It is true that many older forms were not included in early versions of the Unicode standard, but Unicode 4.0 contains more than 90,000 Han characters, far more than any dictionary or any other standard, and work continues on adding characters from the early literature of China, Korea, and Japan. Thai language support has been criticized for its illogical ordering of Thai characters. This complication is due to Unicode inheriting the TIS-620, which worked in the same way. This ordering problem complicates the Unicode collation process [http://www-106.ibm.com/developerworks/unicode/library/u-secret.html]. Opponents of Unicode sometimes claim even now that it cannot handle more than 65,535 characters, a limitation that was removed in Unicode 2.0. == Unicode in use == === Operating systems === Despite technical problems and limitations and criticism on process, Unicode has emerged as the dominant encoding scheme. Windows NT and its descendants Windows 2000 and Windows XP make extensive use of UTF-16 as an internal representation of text. UNIX-like operating systems such as GNU/Linux, BSD and Mac OS X have adopted UTF-8, as the basis of representation of multilingual text. === E-mail === MIME defines two different mechanisms for encoding non-ASCII characters in e-mail, depending on whether the characters are in e-mail headers such as the "Subject:" or in the text body of the message. In both cases, the original character set is identified as well as a transfer encoding. For e-mail transmission of Unicode the UTF-8 character set and the Base64 transfer encoding are recommended. The details of the two different mechanisms are specified in the MIME standards and are generally hidden from users of e-mail software. The adoption of Unicode in e-mail has been very slow. Most East-Asian text is still encoded in a local encoding such as Shift-JIS, and many commonly used e-mail programs still cannot handle Unicode data correctly, if they have some support at all. This situation is not expected to change in the foreseeable future. === Web === Recent web browsers display web pages using Unicode if an appropriate typeface is installed (see Unicode and HTML). Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4.0 and XML 1.0 documents are, by definition, comprised of characters from the entire range of Unicode code points, minus only a handful of disallowed control characters and the permanently-unassigned code points D800-DFFF, any code point ending in FFFE or FFFF and any code point above 10FFFF. These characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or they may be written as numeric character references based on the character's Unicode code point, as long as the document's encoding supports the digits and symbols required to write the references (all encodings approved for use on the Internet do). For example, the references Δ Й ק م ๗ あ 叶 葉 냻 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) display on your browser as Δ, Й, ק, م, ๗, あ, 叶, 葉 and 냻—if you have the proper fonts, these symbols look like the Greek alphabet capital letter "Delta", Cyrillic alphabet capital letter "Short I", Arabic alphabet letter "Meem", Hebrew alphabet letter "Qof", Thai language numeral 7, Japanese language Hiragana "A", simplified Chinese "Leaf", traditional Chinese "Leaf", and Korean language Hangul syllable "Nyaelh", respectively. === Fonts === Free and retail fonts based on Unicode are common, since first TrueType and now OpenType support Unicode. These font formats map Unicode code points to glyphs. There are thousands of fonts on the market, but fewer than a dozen fonts attempt to support the majority of Unicode's character repertoire; these fonts are sometimes described as pan-Unicode. Instead, Unicode based fonts typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. There are several reasons for this: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to be demanding of resources in computing environments; and operating systems and applications are becoming increasingly intelligent in regard to obtaining glyph information from separate font files as they are needed. Furthermore, it is a monumental task to design a consistent set of rendering instructions for tens of thousands of glyphs; such a venture passes the point of diminishing returns for most typefaces. Unicode characters which cannot be rendered are most often displayed as an open rectangle only, to indicate the position of the unrecognized character. Some attempts have been made to provide more information about these characters. The Apple ''LastResort'' font will display a substitute glyph indicating the Unicode range of the character and the SIL International Unicode fallback font will display a box showing the hexadecimal scalar value of the character. === Multilingual Text Rendering Engines === *Uniscribe - Microsoft Windows *Apple Type Services for Unicode Imaging - new engine for Apple Macintosh *WorldScript - old engine for Apple Macintosh *Pango - open source *International Components for Unicode - open source *Graphite - (open source renderer from SIL International) ===Input methods=== On Windows XP, any Unicode character can be input by pressing Alt, then, with Alt down (and using only the numeric keypad keys), pressing the decimal digits of the Unicode characters one after the other. For example, Alt, then, with Alt still down, 9, then 6 and then 0 yields π (Greek lowercase letter Pi). For values less than 256, precede the digits with a 0, to avoid code page translation (see Extended ASCII), e.g. Alt 0, 1, 6, 5 yields ¥. Word 2003 also allows for entering unicode characters by spelling out the code first, e.g. 014B for the 'ng'-symbol and then hitting 'Alt' plus 'X' to substitute the string to the left by its unicode character. Macintosh users have a similar feature with an input method called 'Unicode Hex Input', in Mac OS X and in Mac OS 8.5 and later: hold down the Option key, and type the four-hex-digit Unicode code point. Handling of code-points above 0xFFFF is done by entering a UTF-16; they will be converted into a single character automatically. Mac OS X (version 10.2 and newer) also has a 'Character Palette', which allows users to visually select any Unicode character from a table organized numerically, by Unicode block, or by a selected font's available characters. GNOME follows ISO 14755. Hold down Ctrl and Shift and enter the hexadecimal unicode value. The Opera (web browser) in version 7.5 and over allows users to enter any Unicode character directly into a text field by typing its hexadecimal code, selecting it, and pressing alt+x. == See also == * Table of Unicode characters, 128 to 999 * Free software Unicode fonts == External links == * [http://www.unicode.org The Unicode Consortium] ** Unicode versions: [http://www.unicode.org/unicode/reports/tr27/ 3.1], [http://www.unicode.org/unicode/reports/tr28/ 3.2], [http://www.unicode.org/versions/Unicode4.0.0 4.0], [http://www.unicode.org/versions/Unicode4.0.1/ 4.0.1], [http://www.unicode.org/versions/Unicode4.1.0/ 4.1] ** [http://www.unicode.org/alloc/Pipeline.html new characters], [http://www.unicode.org/pending/pending.html scripts] and [http://www.unicode.org/alloc/investigation.html characters and scripts under investigation] ** [http://www.unicode.org/charts/ Code Charts] (portable document format) * [http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] * [http://www.macchiato.com/unicode/charts.html UTF-8, UTF-16, UTF-32 Code Charts] and a [http://www-atm.physics.ox.ac.uk/user/iwi/charmap.html character map] (JavaScript) * [http://www.eki.ee/letter/ The Letter Database] Uses forms to present groups in list or grid format by hexadecimal. * [http://www.decodeunicode.org/ DecodeUnicode - Unicode WIKI, Typographic project to explain every unicode character with 50.000 gifs in three sizes] *[http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ Example text files using Unicode] *[http://www.lazytools.com/unicode-ascii/ Unicode special character map] is similar to the Windows version. Click a symbol to obtain either the named or numeric code for HTML. * [http://www.evertype.com/standards/csur/ ConScript Unicode Registry] a project to standardize part of the Private Use Area for use with artificial scripts and artificial languages. An explanation of how to propose character names in Unicode is available here. * [http://www-106.ibm.com/developerworks/unicode/library/u-secret.html The secret life of Unicode] "A peek at Unicode's soft underbelly" Describes problems requiring resolution. Includes links to Unicode resources. *Tim Bray's [http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF Characters vs Bytes] explains how the different encodings work. * [http://www.alanwood.net/unicode/ Alan Wood's Unicode Resources] Contains lists of word processors with Unicode capability; characters are grouped by type; characters are presented in lists, not grids. *[http://www.hastingsresearch.com/net/04-unicode-limitations.shtml The strongest denunciation of Unicode], and a [http://slashdot.org/features/01/06/06/0132203.shtml response to it] *Fonts and tools: **[http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html Unicode fonts and tools] for the X Window System ** Unicode TTF fonts: Arial Unicode MS, Code2000: [http://home.att.net/~jameskass/ license info and download link], Junicode: [http://www.engl.virginia.edu/OE/junicode/junicode.html license info and download link], Titus Cyberbit Basic: [http://titus.uni-frankfurt.de/indexe.htm?/unicode/unitest2.htm license info] & [http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp download link] ** [http://earthlingsoft.net/UnicodeChecker/ UnicodeChecker], a Unicode character browser for Mac OS X *Software engineering: ** [http://icu.sourceforge.net/ International Components for Unicode (ICU)] An open source set of libraries that provide robust and full-featured Unicode services for your applications on a wide variety of platforms. ** [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)] by Joel Spolsky of JoelonSoftware.com (this is now outdated, but still a reasonable starting point). ** [http://freedesktop.org/wiki/Software_2futf_2d8 Freedesktop.Org's Project UTF-8]'s purpose is to document and promote proper Unicode support in free and Open Source software. ** [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ Supplementary Characters in the Java Platform] from Sun Microsystems * Seeing [http://www.ianalbert.com/misc/unichart.php the entirety of Unicode printed out] as a single large poster gives a good feel for the size of the code. Unicode Text encodings br:Unicode bs:Unikod hi:यूनिकोड kn:ಯುನಿಕೋಡ್ ks:Yunikōḍa ta:யுனிகோடு th:ยูนิโคด vi:Unicode zh-min-nan:Thong-iōng-bé

Unicode



This paragraph was added to the start of the article: :Unicode is a standard used in computer software for encoding human readable characters in digital form. The most common encoding is the :ASCII code, which can encode a maximum of 127 characters, which is enough for the English language. As computer use spread to other languages, the shortcomings of ASCII became more and more apparent. There are many other languages with many other characters; Asian languages in particular contain many, many characters. I removed it because it is inaccurate (It overplays Unicode-as-a-standard rather than Unicode as a consortium that produces lots of standards), confusing (its mention of ASCII is not clearly historical), and adds no information that isn't already in the article. I assume, however, that it was added because someone thought the existing first paragraph was unclear, so I'm open to suggestions about how to improve it. --User:Lee Daniel Crocker ==Downloads== I am not a techie! Nevertheless I can see the usefulness of much of the material available in Unicode. Neither am I the sort of anti-techie that complains that anything in other than plain-Jane unaccented English alphabetical characters must be thrown out of Wikipedia, or that articles should not be displaying meaningless question marks. I was visiting the chess page, and someone there has made a valiant effort to produce diagrams of how the pieces move by using only ordinary keyboard characters. I'm sure that he would not take it as a sign of disrespect when I say that it looks like shit. :I see no such chart there. User:Evertype 15:43, 2004 Jun 20 (UTC) ::That's because they've been removed. They were there when the comment was posted in March 2002. --User:Zundark 16:11, 20 Jun 2004 (UTC) I'm sure that most of us would like to see the special symbols, letters, or chinese characters at the appropriate time and place. At the same time I understand that for many Wikipedians there are technical reasons which prevent their hardware from dealing with this material (eg. limited memory). Then there are others for whom only the appropriate software is missing. Even some of the people with hardware restrictions may be able to handle Greek or Russian, though probably not Chinese. In cases where I've tried to find the code, I've ended up wading through reams of technical discussions. These discussions may be very interesting, but they don't provide a solution to my immediate problem. The practical suggestion may be a notice at the head of any article containing symbols not in ISO 8859-1 saying in effect. "This article contains non-standard characters. You may download these characters by activating this LINK". user:Eclecticology : Just because an HTML document contains characters that are not in the ISO 8859-1 range doesn't mean that the characters are nonstandard. HTML 4.0 allows nearly all of Unicode to be used in a document, and all web browsers make an attempt to handle any character they encounter. The problem is merely that the underlying operating systems upon which the browsers rely to provide character rendering tends to be either not Unicode aware or just does not have a good selection of fonts (character-to-glyph mappings) installed. : There's no reliable way to guess the user's character rendering capabilities, so we really don't know when to tell people when it would be a good idea to download font files, and fonts tend to be OS-specific anyway. I prefer just to acknowledge in the prose that any non-ASCII characters may or may not render as they are *supposed* to. I don't think we should dumb down the HTML and avoid those characters though. - User:Mjb 18:21 Feb 20, 2003 (UTC) ---- ''In cases where I've tried to find the code, ...'' What exactly were you looking for ? Do you have the Unicode value, and you're looking for a typical glyph (like a ASCII ) ? Are you looking for the Unicode value ? ''These discussions may be very interesting, but they don't provide a solution to my immediate problem.'' What exactly is your "immediate problem" ? ---- Is there a reason to use '<code>foo</code> <code>bar</code> <code>baz</code> ...' instead of '<code>foo bar baz ...</code>'? -- User:Miciah ==UTF-7== Isn't there a UTF-7? Or is an invetion of Microsoft (it's in .NET)? User:Cgs 21:54, 16 Sep 2003 (UTC). :Yes[http://czyborra.com/utf/#UTF-7], but it's virtually never used. --User:Brion VIBBER 23:33, 16 Sep 2003 (UTC) ---- ''The oldest of Unicode's encodings is UTF-16, a variable-length encoding that uses either one or two 16-bit words, manifesting on most platforms as 2 or 4 8-bit bytes, for each character. {NB: This can't be true; UCS-2 has to predate UTF-16!}'' User:66.44.102.169 wrote "{NB: This can't be true; UCS-2 has to predate UTF-16!}" in the article. UTF-16 was previously UCS-2 but I'm not sure that makes the statement untrue as such but I reworded it anyway. User:AngelaUser talk:Angela :In the way the terminology today is used, UCS-2 doesn't have surrogate support, and certainly a 16-bit encoding without surrogate support existed before one with it. I don't think either of these were called UCS-2 and UTF-16 at the time though. User:Morwen 11:52, 6 Dec 2003 (UTC) ::I wrote the comment that UTF-16 can't be the oldest encoding. Currently text may be encoded in UCS-2 or it may be encoded in UTF-16. Many Windows application designers pay no heed to the difference, but by their assumptions clearly support UCS-2 and not UTF-16. I speak of the MS-Windows world, wherein UCS-2LE holds dominant sway. In fact, Microsoft documents very commonly use the term "Unicode" as a synonym for UCS-2LE. Anyway, I meant that in the current time, we have both UCS-2 encodings and UTF-16 encodings, and I suspect we will all agree that the UCS-2 encodings (by whatever name) predate the UTF-16 encodings. :) ==A Brain Dropping follows:== Wasn't Unicode created to encode ''all'' languages - not just 'human' languages? In the future then, why couldn't Unicode conceivably be used to encode extraterrestrial languages as well?(well, why not? hehehehe) Therefore, shouldn't the 'human' be removed from this page? One possible alternative: Unicode is the international standard created, whose goal is to specify a code matching every character needed by every known written language to a single unique integer number, called a code point. :The Universal Character Set, whether in its Unicode Standard or its ISO/IEC 10646 manifestation, was made to encode the ''writing systems'' of the world, not the ''languages'' of the world. User:Evertype 15:43, 2004 Jun 20 (UTC) :I'm not an expert on this, but I don't believe the Unicode Consortium seeks to encode writing systems whose existence we don't yet know of (and if we ever meet aliens who use more than 232 characters, Unicode will have a problem). I believe that Tolkein's Elvish scripts are in there, but other fictional scripts like Klingon are not. So they're not being wholly anthropocentric. User:Adamrice 00:02, 11 Jul 2004 (UTC) :Tengwar and Cirth are not yet encoded, but are roadmapped for encoding. To answer the brain-dropping: Were we to meet aliens who had an encodable writing system, it is likely that their characters would fit. User:Evertype 11:59, 2004 Jul 11 (UTC) ::I find that statement a little overreaching. The aliens could easily have a writing system with a million symbols, or several Chinese-size writing systems or a history of writing that dates back millions of years instead five or six thousand. Or simply be a group of ten different species of aliens with writing histories as complex as ours. The best we can say is that humans, all told, will use about 3 planes, and there's 17 planes of characters. --User:Prosfilaes 03:49, 12 Jul 2004 (UTC) : There is currently no Klingon alphabet/writing system that is suitable for encoding. The glyphs shown in Star Trek are merely a nearly-1:1-mapping of the Latin alphabet. The Star Trek folks seem to have a real Klingon alphabet, but have not yet published it. User:JensMueller 10:06, 5 Sep 2004 (UTC) :: Any alphabet is going to have a nearly-1:1 mapping to the phonemes of the language, and hence to the Latin transcription. What must a "real" Klingon alphabet be ill-fitted to the Klingon language? --User:Prosfilaes 04:11, 9 Sep 2004 (UTC) ::: There has been no sufficient klingon alphabet published, and nearly all klingon users online are using a "roman transcription". The reason the klingon encoding proposal was turned down was because it was hardly used, (and also because the sounds were mapped to english sounds, rather than klingon). If a canonical klingon alphabet would appear, I guess a unicode encoding is likely. Klingon seems to have 26 sounds, according to Wiki article and www.kli.org, which shouldn't be too difficult to find a mapping area for. :::: The Klingon piqaD alphabet has a mapping in the Private Use Area of Unicode, and has recently come into occasional use on the Internet. See, for example, [http://www.kli.org/wiki/index.php?Chatting%20in%20pIqaD Chatting in piqaD] and [http://qurgh.blogspot.com/ qurgh's blog] The requirement for getting Klingon piqaD an assignment of regular Unicode code points is some level of use in data interchange. We can expect that it will qualify at some time in the future. --User:Cherlin ==AAT== Can someone who knows who AAT are add to the AAT disambiguation page appropriately and also send the link on this page to the right place? Thanks User:EddEdmondson 08:59, 19 Jun 2004 (UTC) :Done. (AAT is Apple Advanced Typography, but we have no article on it at present.) --User:Zundark 09:40, 19 Jun 2004 (UTC) ==Perhaps for "Issues?"== One of classicists' issues with Unicode has been the omission of the LATIN Y WITH MACRON characters. While the omission has been corrected in Unicode 3, most user agents don't know to render anything for that codepoint. Somewhere in that story is an issue that perhaps might make sense in the article -- either the omission of the letter, or the outdated support available by user agents (I don't see Microsoft rushing to update its fonts and packaging them as an update to Windows or Internet Explorer just to comply with recent standards). :Definitely not. This is not an "issue" with Unicode, but with implementation. We add support for classicists constantly, and it wasn't Y WITH MACRON alone either. User:Evertype 21:49, 2004 Jul 10 (UTC) ==UTF-8 as the basis for multilingual text?== ''"UNIX-like operating systems such as GNU/Linux, BSD and Mac OS X have adopted Unicode, more specifically UTF-8, as the basis of representation of multilingual text."'' Mac OS X stores a lot of text in UTF-8, but the other UTF's are also supported throughout the system and widely used. I agree that UTf-8 is currently the most widely used Unicode encoding (because it is the most legacy-compatible encoding), and that is important enough to mention in the leading section, but perhaps it should be rephrased so that it doesn't mislead the reader into believing those OSes don't support other kinds of Unicode? — User:Chmod007 04:16, 9 Sep 2004 (UTC) == Section 0 == I think section 0 which currently reads could be improved by changing: ''In computing, Unicode is the international standard whose goal is to specify a code matching every character (computing) needed by every written human language, including many dead languages in small scholarly use, to a single unique integer number, called a code point.'' To: ''In computing, Unicode is the international standard whose goal is to specify a code matching every character (computing) needed by every written human language, including many dead languages in small scholarly use such as foo and bar as well as [some other good example, perhaps a made-up language?], to a single unique integer number, called a code point.'' I think the intro would be better by adding two examples there, furthermore i ''think'' ''is the international standard'' should be ''is an international standard'', or has it been approved by a major authoraty as ''the'' standard? --User:Ævar Arnfjörð Bjarmason User:Ævar Arnfjörð Bjarmason/ User talk:Ævar Arnfjörð Bjarmason/ [ Bjarmason] User:Ævar Arnfjörð Bjarmason/ 17:44, 2004 Oct 5 (UTC) ::The parts of Unicode which are also in ISO 10646 most likely define it as '' the standard'', aslo given the fact that the maintaince of the ISO 8859 has been put into hibernation. User:Pjacobi 20:27, 5 Oct 2004 (UTC) And, argueing about section 0, what about: ''in internationalization of software''. A Thai programmer writing a program with Thai user interface for Thai customers doesn't fit at all the definition of internationalization. -- User:Pjacobi 20:30, 5 Oct 2004 (UTC) The interesting thing for most people is that it provides a way to store text in any language in a computer. Starting off by mentioning "unique integer numbers" doesn't make Unicode easier to understand. Even as a computer programmer, I have a bit of trouble reading that sentence and understanding what it means. And it's not really true as given; characters in Unicode is a polite fiction. Many characters (Maltese "ie", Lakota p with bar above, many Khmer characters) are more then characters in Unicode-ese. Going to rewrite boldly. --User:Prosfilaes 21:47, 11 Oct 2004 (UTC) == Largest and most complete == This phrase has appeared recently without much discussion. "Unicode is the most complete character set, and one of the largest." Could anyone give justification? -- User:TakuyaMurata 06:14, Oct 12, 2004 (UTC) :ISO/IEC 10646? Unicode reserves 1,114,112 (2^20 + 2^16) code points, and currently assigns characters to more than 96,000 of those code points. No other encoding even comes close. User:AnárionUser_talk:AnárionAnárion 09:09, 12 Oct 2004 (UTC) :As for the completeness, take a look at mapping of Unicode characters for all the scripts encoded. User:AnárionUser_talk:AnárionAnárion 09:10, 12 Oct 2004 (UTC) :GB18030 is by defintion as large as Unicode, but except for the pre-existing mappings, all GB18030 codepoints and Unicode codepoints, including yet unassigned ones, are algorithmically mapped. So, it is more like a strange encoding form of Unicode. :For certain scripts, there are character sets with more precomposed glyphs, e.g. VISCII for Vietnamese, TSCII for Tamil, or some scholarly encoding for pointed Hebrew. But they don't count as larger, as they don't support more than one or two scripts, and they don't count as more complete, as the encoded characters are uniquely representable in Unicode as sequences including combining characters. :So yes, according to all my knowledge and research, Unicode is the most complete and one of the largest character set, for which information is freely available in languages I can read. :If you have knowledge of implemented characters sets (not counting proposals, shich are cheap to made) which are more complete than Unicode, please elaborate. :Otherwise, I'll revert your reversal. :User:Pjacobi 09:12, 12 Oct 2004 (UTC) :::Unicode is meant to be a superset of all known character sets, so it is hardly possible that there are character sets not covered by it. (And surely all VISCII and TSCII characters are included in Unicode as they are). — User:Monedula 11:11, 12 Oct 2004 (UTC) :::: Unicode is only a superset of the character sets in use when they started out. In practice, they were a superset of most new character sets up to 2000, at which they stopped encoding new precomposed characters. (So they're still a superset of them, in a practical sense.) One of the Chinese standards that encoded every minor variation on the ideograph seen wasn't added to Unicode, which decided to adopt a most unifing encoding policy, but all the Chinese and Japanese standards in widespread use are subsets. --User:Prosfilaes 21:53, 12 Oct 2004 (UTC) :::I agree with your first half-sentence, but Unicode has decided to not encode any more precomposed characters. Only for pre-existing national and international standards there was a consensus to include the precomposed characters. Now, new suggestions for precomposed characters are routinely declined, and for good reasons. In fact it is hoped that in some future version (6.0?) all exsiting precomposed characters will become deprecated. A like case exist for glyph variant. What got in, is in, but new additions will be declined. See also: http://www.unicode.org/standard/where/ :::So there is no chance (neither is there a necessity) that TSCII codepoint 0xE0 "tU" will be assigned a single codepoint in Unicode. Instead it transcodes as [U+0ba4 U+0bc2] --User:Pjacobi 12:37, 12 Oct 2004 (UTC) : Ignoring the whole Han ideographs, and ignoring the sets that are basically new encodings of Unicode, what is there? TRON? --User:Prosfilaes 21:53, 12 Oct 2004 (UTC) :: The idea of switching to several character encodings isn't unique to TRON, it was already included in ISO-2022. And both have in common, that it makes implementation difficults scatters the design process of new script encodings instead of unifiing it. I don't think much of it is still in use. Heck, you can't even use both (TRON and full ISO-2022 with escape switching) on the Web or in e-Mail. Most Unicode criticism on the TRON advocate's pages are just outdated or a result of misunderstandings. --User:Pjacobi 22:32, 12 Oct 2004 (UTC) I am convinced that probably Unicode is the largest and most complete character set but can we still ignore criticism on unicode? What I am often heard about unicode, it is not inadequate in handling old text or text containing outdated characters. Maybe most of criticism are pointless or a result of misunderstandings but I still hear them and I don't think we should make a general statement which not everyone agrees with. Unicode is meant to be the largest and the most complete but if it is really so is disputed, if such dispute is nonsense in actuallity. -- User:TakuyaMurata 22:41, Oct 12, 2004 (UTC) :Yes, of course the criticisms must be included, but we must try hard to find the right criticisms and the right way to present them. Don't forget the long expertise of Unicode in this field and the large number of field experts contributing to the evolving Unicode effort. We would achieve nothing but spoil the creditability of the Wikipedia, if we hastily add criticisms of mediocre quality. :A generic problem with Unicode is the long process it takes, to get additions done. This is the downside of centralism. And you need somebody with "weight" to get major additions and changes done. Either a national standards body are field experts of value. :A brainstorming list of criticisms: :*Unicode got the Hebrew points for biblical texts all wrong (or something like that, I'm no expert) :*Unicode has unified scripts (and requires different fonts and markup to differentiate), which should not have been unified. :*Unicode has not unified scripts, which should have been unified, as they are only font differces. :*Unicode has too few presentation forms for complex shaping scripts :*Unicode has not enough presentation forms for complex shaping scripts :*Unicode has too few precomposed glyphs :*Unicode has not enough precomposed glyphs :As you can seen, some criticisms arise out of the fact, that decisions must be made in a standard on questions which are viewed differently by different people. :User:Pjacobi 00:32, 13 Oct 2004 (UTC) :If you want to say that Unicode is not the largest and most complete, then there must be something that's larger or more complete. If you tell us what it is, we can discuss it. :Most of the complaints about Unicode don't stem from size or completeness. Most of the scripts and characters that are left are very obscure and almost invariably not used for writing new material. The complaints come from how Unicode treats the existing scripts; often the question is whether two entities should be treated as distinct. Since in all these cases, they are distinct in some ways, and not in others, there's no "right" answer that will satisfy everyone. The Chinese and Japanese encodings that are supposedly more "complete" are in reality more fine-grained, in that they seperate characters that Unicode unifies. --User:Prosfilaes 20:01, 13 Oct 2004 (UTC) == Reorganization == I made some reorganization of the sections and the continuing work on the leading section. I think the new 4 big sections make good sense: origin and development, mapping and encoding, process and issues and in use. In addition to this, we probably need: * difference in character and glyph; we should give some example * difference in mapping and encoding; particularly, what is code point, what is plane? * short summary of utf; what is utf? and why we want it * size comparison, particularly what unicode not to include; perhaps Pjacobi is right that some criticism are wrong but it is still true that many people advertise their sets as being larger and more complete. We need some response to them. If I have some time, I will try to address them but you can also help me. Finally, I'm sorry for late reply to unicode as largest and the most complete question. I slighly reworded the mention. Please make further edit if you think necessary. -- User:TakuyaMurata 20:42, Oct 17, 2004 (UTC) :You must give some concrete examples who advertises which character set to be larger in what specific sense. --User:Pjacobi 22:18, 17 Oct 2004 (UTC) ::ok, the press release of chokanji 3 [http://www.chokanji.com/press/ck3/010116ck3press.html] (in Japanese) says it supports 170,000 kanji while Unicode handles 20,000 chinese characters, 12,000 of which are kanji. -- User:TakuyaMurata 02:15, Oct 18, 2004 (UTC) ::: a) If I'm not mistaken, the press release is dated 2001-01-06. So, nearly four years later, are there any implementations? Can you give the URL of a single webpage in this charset? Is the IANA registration in progress? Does somebody work on GNU iconv support? Does somebody worl on IBM ICU support? You can't compare vaporware to a widely implemented standard. ::: b) As of version 4.0, Unicode supports 71,000 Han characters, it is a horribly outdated or mis-informed to state the number 20,000. And PRC is busily adding more. It is a political decision of JIS, not to propose adding more kanji. Either because JIS doesn't see the necessity or for other reasons. :::User:Pjacobi 06:12, 18 Oct 2004 (UTC) :We don't compare. You wanted "some concrete examples who advertises which character set to be larger in what specific sense.". So this is the answer. Again and again, I didn't mean what they are saying so fair. I won't use their product because there is just so little compatibility and besides, I don't have any pratical problem with unicode. I mean I agree with you so I am not sure whom you try to convince. -- User:TakuyaMurata 13:14, Oct 18, 2004 (UTC) Thank you for giving the concrete example. Yes, I specifically asked for it. I apologize for replying in flame-war style. --User:Pjacobi 14:41, 18 Oct 2004 (UTC) ''Many documents in non-western languages, for instance, are still represented in other character sets.'' Which languages? Which character sets In this generality it doesn't help. Please state languages and character sets used. And remember, GB18030 is now fully harmonized with Unicode and cannot be considered a different character set, but Unicode encoding form standardized by somebody other than ISO or Unicode Org, namely the Guobiao. --User:Pjacobi 22:23, 17 Oct 2004 (UTC) :Maybe Shift-JIS? I don't think it is ''only'' character set used beside Unicode. If you know more, that would help. -- User:TakuyaMurata 02:15, Oct 18, 2004 (UTC) ::The largest use of a non-Unicode charset is still EBCDIC, ASCII and ISO-8859-1, as seen on this Wiki. So this doesn't look like a west vs east problem to me. The difference is, that almost universally all other charsets are considered to be subsets of Unicode nowadays. And especially the HTML and XML character model explicetely states, that while the physical charset may vary, the logical charset is always Unicode. Also in programming, it is nearly always assumed, that everything can be converted (and most things reversibly) to Unicode. ::So if I can judge this correctly, the Unicode character encoding models is only challenged by some users of Japanese and not much is known outside of Japan of this. As said above, I'm very skeptical about the practical relevance of the Unicode challengers. But the interesting point, why this happens in Japan, seems to be good stuff to write a separate article about Japanese character encoding. ::User:Pjacobi 06:23, 18 Oct 2004 (UTC) :I had absolutely no intention to make a case like west vs east problem. If you think some sentences are problematic, then go ahead to edit. I just wanted to illustrate the adoptation of unicode and the sentence absolutely never mean to imply the use of unicode is problematic or anything. Besides, I am not sure what you are saying. I don't think you believe any non-unicode character sets have died out completely. We want to show when when unicode is used and when it is not. I mean what you want after all? -- User:TakuyaMurata 13:14, Oct 18, 2004 (UTC) ::Sorry for being unclear. And apologies for not contributing to the article itself in the moment. I am of the opinion some non-trivial additions are dearly needed (on the character model, on character vs glyphs vs graphemes), but I feel unable to do it myself. Perhaps I'll try it next week. ::No, I surely don't want to say non-unicode character sets have died out completely. What I tried to say, is that the character encoding model of Unicode is nearly universal success and nowadays other character sets are mostly seen as subsets of Unicode. This wasn't the case ten years ago. ::User:Pjacobi 14:41, 18 Oct 2004 (UTC) It's fine. I was just puzzled about what upset you so much. As a matter of fact, I am neither the backer of unicode nor the detractor. I am only interested in making the article informative for those who have questions about unicode. It's very surprising that many people don't know well about unicode, even computer programmers. The article could be a help for them. -- User:TakuyaMurata 15:54, Oct 23, 2004 (UTC) :Fully agree. When supporting charset issues (as I do sometimes for Firebird SQL) it's quite amazing that some programmers at first don't even see a problem in the different mappings between characters and bytes. --User:Pjacobi 17:55, 23 Oct 2004 (UTC) ==Phishing== In the section that talks about pre-composed characters vs. composing with several codepoints, how about mentioning that this capability opens up lots of opportunities for phishing once URLs are more universally excepted in UTF-8? For example, once accented characters are common in website addresses, links with a pre-composed "è" and separate "e" plus an accent will point to different sites, but look identical to the user (in fact the intent is for them to look the same). I don't know if this info belongs here, but it's an interesting tidbit. User:Rlobkovsky 00:06, 6 Dec 2004 (UTC) Insert non-formatted text here :If and when URIs start supporting characters beyond ASCII in a standard way, some decomposing must take place, as according to the principles behind unicode the precomposed character à is exactly equivalent to ` + a. Any future internet domain funkèynáme.ext will have to point to the same IP(v6?) address for all its possible decomposings. 12:26, 28 Dec 2004 (UTC) ==Sentence== "To address the short coming, Unicode is being revised periodically with the addition of more characters and increase in the size of characters potentially represented in unicode." It's something of a moot point now, but in case it comes up in the future, the reason I cut that sentences is because it was inaccurate. They don't add more characters to address the shortcoming (one word) that people don't use Unicode; there's probably less than a hundred thousand people who would use any of the scripts that are going to be added to Unicode. And for several of the scripts, like Egyptian Hieroglyphics or Hungrian Runic or Tengwar, there's no commericial interest in the script, and there's little to no academic interest in encoding the script (the Egyptologist community has basically told Unicode to go away and come back in few decades). Hobbyist demand for unencoded scripts isn't a huge shortcoming that Unicode is trying to overcome. What does "increase in the size of characters potentially represented in unicode" mean? I assume by size, you mean number (since you can increase the size of characters just by using a larger font), but I'm not sure what "potentially" means here. As I read it, it's redundant with "addition of more characters". --User:Prosfilaes 03:38, 11 Dec 2004 (UTC) ''The simplest representation of Unicode (giving every character the same number of bits, rather than a more complicated variable-width encoding) has historically increased from 16 bits to about 20 bits. There is (currently) about 2^20 "potential" characters. I suspect the original author suspected that in the future, *more* than (roughly) 20 bits will be required; and that the consortium is planning to "periodically" increase the number of bits. --User:DavidCary 22:17, 11 Feb 2005 (UTC)'' ::The consortium doesn't plan to increase the number of bits. In 15 years, two planes of characters have almost been filled, out of 15. Just as importantly, those two planes include virtually every character used in a computer; a few people use Tengwar or pIqaD or Cuneiform or Egyptian hieroglyphics, but they're incredibly rare and they amount to a few thousand characters, not the more than a half million it would take to require expansion. And honestly, if it was a matter of expanding for those or ignoring them, their concerns are minor enough and the changes in every piece of Unicode software major enough I suspect they would get ignored. --User:Prosfilaes 00:30, 1 Jun 2005 (UTC) :::it depends on exactly how you define filled. :::the [http://www.unicode.org/roadmaps/bmp/ BMP] (plane 0) is basically full mostly with fully allocated and standardised codepoints :::the [http://www.unicode.org/roadmaps/smp/ SMP] (plane 1) is mostly stuff in various stages of approval but still has quite a bit of room marked as completely unknown (less than half though) :::the [http://www.unicode.org/roadmaps/sip/ SIP] (plane 2) is more than half filled by "CJK Unified Ideographs Extension B" and most of the rest is pencilled in for yet more CJK stuff. :::the [http://www.unicode.org/roadmaps/ssp/ SSP] (plane 14) is mostly empty right now :::iirc planes 15 and 16 are reserved for private use but i'm not sure. :::so if you count the areas that are pencilled in for future scripts then a LOT more than 2 planes are in use. ==Revision history has a future date== please justify this. If it is not justified within a few days i will be reverting User:Plugwash 12:15, 25 Dec 2004 (UTC) :I've reverted it already. Future dates are never justified for this sort of thing, because schedules can change. --User:Zundark 12:39, 25 Dec 2004 (UTC) ==A little clarification about Tolkien's scripts and Klingon?== I don't mean to be a spoilsport, but these bits just don't seem to fit in _at_ all. I was reading through it just then and I thought an anonymous user must have added it in for a laugh. I think a rewording's in order, but perhaps it's just me. I definately don't think it deserves quite as much as has been written about it, though. :-/ Someone care to back me up here, I'm not too sure of myself? Edit: Under the 'Development' Category --User:Techtoucian 10:16, 07 Jan 2005 (UTC) :I think they fit, if only because they show how the Unicode consortium actually considers scripts which to some seem no more than a 'laughing matter' -- certainly Tengwar and Cirth see more actual use than some of the scripts which are already encoded. 12:41, 7 Jan 2005 (UTC) ==Chinese Punctuation== : "Unicode also has a number of serious mistakes in the area of CJK punctuation. For example, it mistakenly treats partial punctuation marks in the various CJK encodings as full punctuation marks, for instance treating half of a CJK ellipsis as the same as an English ellipsis, even though the two glyphs are both semantically and visually dissimilar (considering that the CJK ellipsis can be centred between the baseline and ascender, but the English ellipsis must always be placed on the baseline)." --User:Gniw 06:53, 6 Feb 2005 (added to article) This page should not be a page of everyone's minor complaints about Unicode. I've read the Unicode list for four or five years, I've read the Standard, I've read both pro- and anti-Unicode pages (including all the Tron pages in English, and they include about every general or Japanese-specific Unicode complaint possible) and I've never heard this before. Given that it seems to be one person's complaint, I don't think it's worthy of being added to an encyclopedia article. --User:Prosfilaes 21:48, 6 Feb 2005 (UTC) :This is not a minor complaint if you use do bilingual typesetting or write bilingual (Chinese and English) web pages. The result of the ellipsis misidentification in Unicode causes very ugly web pages to result in mixed English-and-Chinese web pages. But given the sad state of punctuation typesetting taught at art schools these days, and the way English computing has changed Chinese typesetting, I'm not surprised that no one has talked about this. User:Gniw 22:49, 9 Feb 2005 (UTC) ::I stand on my position. This is an encyclopedia, not a list of what's wrong with Unicode. If there's no English pages on the issue, then most of the people who could fix the issue have never heard of it; and if no one has ever seen fit to bring it before them, I hardly see it as a major issue. I wouldn't post bug reports about a program on Wikipedia, so I don't see this as appropriate. ::But please, if someone else has an opinion on this, please chime in.--User:Prosfilaes 03:45, 11 Feb 2005 (UTC) :::Why isn't this a big issue? The triviality of this is precisely the reason it is important; it shows that the Unicode has mistakes that even primary school students should be able to spot, yet here it is in the standard. This just shows how ''sloppy'' Unicode is regarding CJK. :::Do you really think that if people who are likely to be ''affected'' by the issue has mentioned about it, and the discussion happen to be not in English, then it is not an issue?! :::What you mean is "the use of English is a requirement for an issue to be recognized as an issue" or "no matter whether people have discussed it or not, if it has never been discussed in English then it cannot possibly be an issue". Or, in short, "English is the measure of all things". If this is not Western imperialism I don't know what it is, and you don't understand why the Japanese are opposed to Unicode? Opposition to Unicode is not really so much of a technical problem but more a perception of a lack of respect, the fact that my contribution was deleted on New Year tells a lot. User:24.101.156.72 19:18, 11 Feb 2005 (UTC) :::: If a Chinese encyclopedia wrote an article complaining about some problem in the English Wikipedia, and they never mentioned it to anyone who could fix it, we'd be a little pissed. Bring the issue before us, and if we choose not to fix it, then there's a valid complaint, but we can't fix what we don't know about. If it doesn't matter enough to bring it to the people who can fix it, or the people discussing it don't respect the standard enough to try and fix it, it's not an important issue. :::: I think says a lot that you're not discussing the issue, you're complaining about imperalism and that somehow people shouldn't correct articles on holidays. I will repeat again, this is a thirty year old problem ''made by Chinese standards''. You can't do better using Big5 or any other Chinese standard. Which says a lot to me about the importance of the problem. :::: While we're on the subject of "Western Imperalism", I will note that the US-based SIL International and the Ireland-based Michael Everson have been instrumental in getting new scripts (e.g. several Philippine scripts like Buhid) into the standard, while the Japanese standards body sent a letter to the ISO working group asking for such new standards efforts to cease. Such accusations are insulting and provably inaccurate. --User:Prosfilaes 23:46, 11 Feb 2005 (UTC) :::::Excuse me. Do you know what a "double byte character set" is? Big5 (as well as GB, EUC-KR, EUC-JP, and Shift JIS) is a DBCS, and by the very nature of a DBCS, you can't encode a whole CJK ellipsis. We ''have'' to encode half of the ellipsis. Now when the Unicode committee look at the CJK national character sets and decide that half a CJK ellipsis is equal to a full English ellipsis, that is incredible sloppiness. This is not a "thirty year old problem made by Chinese standards" in the context of Unicode. :::::And how do you want me to discuss the issue? When whatever I write will simply get deleted. User:66.163.1.120 00:05, 12 Feb 2005 (UTC) ::::::It's not incredible sloppiness. It's a unification decision that had some negative side effects. (And we could discuss the incredible sloppiness involved in assuming that every non-ASCII character was double-width, one that still sometimes plagues Russians who get the pleasure of dealing with double-width Cyrillic.) And I want you to discuss it here, on the talk page, instead of making changes on the main page, until some sort of consensus is reached. (And I'd really like a third party to chime in.) --User:Prosfilaes 01:32, 12 Feb 2005 (UTC) :::::::I cannot understand why this is not sloppiness. The two are completely different. As I originally wrote, (1) they are different in form (the CJK ellipsis can be set on the baseline, or between the baseline and the ascender; the English ellipsis can only be set on the baseline) and (2) they are different in meaning (two "ideographic three dot leader"s, [http://cl.cocolog-nifty.com/dtp/2004/09/u2026_horizonal.html as some Japanese people think it should be called], are required to make one true ellipsis, the leader itself is meaningless; one "horizontal ellipsis" (U+2026) is meaningful by itself). The two cannot be unified no matter whether they consider unification to be based on form or on meaning. :::::::Ok, you might argue that this only means they are unable to spot the differences. But they go into so much effort into distinguishing between almost-indistinguishable variations in ideogram forms (many are really typographic stylistic variations that unfortunately came to be associated with different countries), not making comparable effort in distinguishing these two glyphs certainly sounds extremely strange. Even if they had checked the punctuation sections of a Chinese or Japanese dictionary they would have realized that the "ideographic three dot leader" is not itself a punctuation mark. And this has the added benefit that dictionaries usually set the ellipsis between the baseline and the ascender, so they would ''simultaneously'' realize that the two are different in form. In short, there is simply no basis for "unification": Yet they got "unified". Aside from "incredible sloppiness" I really cannot explain this. :::::::(I do accept that Unicode unifications are sometimes based on form, though I think this is contrary to the spirit of Unicode unifications. I personally don't like the CJK unification myself, and you won't understand why I feel this way until you try to work on a Unicode font yourself. But if you ask for my objections to unification decisions, I'll say the unification of the umlaut and the diaeresis really make no sense considering they dis-unify a lot of other things (I'm talking about western script, not CJK) that look 100% identical. In the case of the CJK vs English ellipses, form is not even a question, since they are different in form.) :::::::I do agree with the double-width mess. For us the opposite problem occurs, that all the box-drawing characters become single-width, making Unicode almost useless in terminal emulators if box-drawing characters are to appear anywhere. --User:Gniw 03:45, 12 Feb 2005 (UTC) ::::::::First, I stand by my point: for 15 years, this unification has stood, and no one has complained to Unicode. For probably ten of those years, there would have been no problem disunifying the characters, yet not a single standards body made the request. If they were so completely inappropriately unified, there has been incrediably sloppiness and apathy on the basis of the users of the affected scripts. ::::::::You make too many assumptions about what I do and don't understand. I believe I understand the reasons why people disagree with CJK unification, and seriously doubt that making a font would make a bit of difference. The whole question is whether the difference is a difference in preferred fonts or a difference in script. ::::::::You are apparently a splitter. Besides the fundamental backward compatibility problems, I can't imagine trying to explain to the people at Distributed Proofreaders that coöperate uses a different ö from Köln. Splitting these would cause a world of pain to the advantage of a few librarians. In any case, the various opinions on when to split and when to unify a much more general and interesting topic to add to the page. --User:Prosfilaes 00:31, 13 Feb 2005 (UTC) :::::::::Well, I think I am correct in assuming that you have never worked on a Unicode font. Before I attempted to work on a Unicode font some time ago, I thought just like you (being content with the state of the Han unification). :::::::::In the current state of the Han unification, there are many characters that are not unified. However, after adding a radical, the new characters are all unified. :::::::::If I want to make ''one'' Unicode font containing all the ideograms (not an unreasonable thing, since making such a font requires so much effort), which style should I choose? If adding the radicals would not make the new characters unified, I'd be all happy too (it would just mean that all variants are distinguished, as opposed to variants being not distinguished); as it is, ''no matter which style I choose, I end up with a font that is wrong.'' :::::::::Regarding the ellipsis itself, it is not a difference in font. Would you consider an ellipsis-like glyph that is raised above the baseline (to about x-height) suitable for typesetting English? From ''your'' viewpoint, this is exactly what unification of U+2026 and the hypothetical "ideographic three dot leader" means. :::::::::In a sense, the mis-unification of the ellipsis and the "ideographic three dot leader" can be thought of as equivalent to the problem of having full-width Cyrillic letters (in that both mistakenly equates a glyph that's only appropriate in C/J/K with an incompatible western glyph). If you find full-width Cyrillic letters unacceptable and is "incredible sloppiness", I fail to understand why an ellipsis raised to x-height for English is acceptable or is not the result of sloppiness. :::::::::I would not object to your calling us having "incredible apathy" regarding Unicode. We have already acquired "incredible apathy" after using the suboptimal national character sets for so long; and many of our typesetting and/or punctuation conventions have been destroyed by Western-centric computing for so long (can you imagine just about ten years ago even westerners know that in C/J/K, numbers should be grouped by myriads, but now many Chinese do not even know this, but rather group digits by thousands and then laborously count the digits every time a large number is being read… and many Chinese are so used to western-style underlining that they are now desensitized with the grammatical mistakes they are making every time they underline Chinese words that are not proper names…) I definitely think that this is pathetic enough, and there is no need for Unicode to make this kind of mistakes to further worsen the situation. :::::::::I am not saying that the knowledge of proper punctuation has not deteriorated in the West; but at least the deterioration has not been codified into an international standard (unless I count this ellipsis mis-unification)… --User:Gniw 04:30, 13 Feb 2005 (UTC) :::::::::PS: Perhaps there is; other than this ellipsis thing, there is also this hyphen-dash confusion. It seems to be just as bad… ::::::::::afaict the hyphen-dash issue comes from the fact that ascii and other encodings of its era came from the days when charactors on computers were fixed width. given that and the limited number of code values availible in ascii it seemed totally reasonable to unify the hyphen dashes and minus signs. There was also the unification of beta and sharp s in ibm code page 437 User:Plugwash 02:46, 1 Jun 2005 (UTC) == Revision history year-wikilinks == The year wikilinks in the revisions list are a little confusing; I clicked through thinking I was going to be led to that particular revision, but found myself on a general-year page. Could you reconsider these links please? Thanks. User:Ceyockey == Unicode adoption in e-mail == : ''The adoption of Unicode in e-mail has been very slow. Most East-Asian text is still encoded in a local encoding such as ISO-2022-JP, and many commonly used e-mail programs still cannot handle Unicode data correctly. This situation is not expected to change in the foreseeable future.'' This doesn't look like an accurate picture to me. Mac OS X's default Mail.app client has transparently supported Unicode since 2001. Didn't Windows 95's ''Internet Mail and News'' or ''Outlook Express'' have Unicode support even earlier? I don't know how ''widely used'' Unicode is, but hasn't it been very ''widely supported'' for years? ''—User:Mzajac  User talk:Mzajac  2005-04-12 21:20 Z'' :Keep in mind that that some programs support unicode does not mean they can handle text encoded in unicode ''correctly''. The situation may have changed since then, but I used to hear that you should not send mails in unicode because many programs have problems with them. You see I heard a report that even gmail does not correctly handle the subject of e-mails. More research would certainly help, but I don't think the above is far from the reality. -- User:TakuyaMurata 02:35, Apr 13, 2005 (UTC) == Input methods == :On Windows XP, any Unicode character can be input by pressing Alt, then, with Alt down (and using only the numeric keypad keys), pressing the decimal digits of the Unicode characters one after the other. For example, Alt, then, with Alt still down, 9, then 6 and then 0 yields π (Greek lowercase letter Pi). For values less than 256, precede the digits with a 0, to avoid code page translation (see Extended ASCII), e.g. Alt 0, 1, 6, 5 yields ¥. This just doesn't work when I try it. Pressing Alt-9-6-0 gives me └, which appears to be "Box Drawings Light Up And Right", character x2514/9,492 (└). However, Alt-0-x-x-x does work for me and always has (I can get the yen symbol fine). Does this statement need correction or clarification? —User:Simetrical (User_talk:Simetrical) 01:57, 8 May 2005 (UTC) Forgot to mention, I ''do'' use Windows XP, English-language SP 2 to be precise. —User:Simetrical (User_talk:Simetrical) 02:31, 8 May 2005 (UTC) :I use WinXP, Spanish-language SP2, and it does not work for me, either. Nor does it work for anyone I know who uses WinXP, either. By the way, the character '└' can also be obtained by pressing Alt+192 - moreover, I have found that under WinXP, Alt+number produces the same output as Alt+number modulo operation 256 (provided that any zeroes before the original number are preserved). So, Alt+289 produces '!', Alt+416 produces 'á', and Alt+0416 produces ' ', the non-breaking-space. :I think that paragraph should be removed. --User:Fibonacci 21:53, 21 May 2005 (UTC) ::it seems to depend on the edit control in use. it seems stuff that uses the standard edit (e.g. notepad) doesn't allow unicode entry with alt+numpad whereas stuff that uses the standard richedit (e.g. wordpad) does (tested on english winxp non-sp2 not sure if its original or sp1). User:Plugwash 22:37, 21 May 2005 (UTC) ::: The way I understand it, a four-digit or longer number enters the Unicode character. A three-digit number under 256 enters the character in the current code page, which I suppose would be Win CP-1252 for English and some European languages (don't know if that includes Spanish). It appears that three-digit numbers over 255 are processed with some funky math (Shouldn't numbers over 255 be Unicode? Can anyone think of a reason for using modulo-256 except programmer laziness?). ''—User:Mzajac  User talk:Mzajac  2005-05-25 17:45 Z'' ::NO NO NO ::in apps that use the windows EDIT control (ie notepad) you CANNOT enter unicode with alt+numpad (unless the app makes special provisions which some apps seem to do) and numbers entered with alt+numpad are treated modulo 256 regardless of lengh ::in apps that use the windows RICHEDIT control numbers over 256 and all numbers 4 digits or more are unicode (for numbers like 052 the local code page matches unicode anyway so its impossible to really tell) ::other apps that set up thier own edit controls may behave differently again.User:Plugwash 18:40, 25 May 2005 (UTC) == Nifty resource. == I found, at some point, a [http://www.fileformat.info/info/unicode/index.htm nifty resource for Unicode at fileformat.info]. It has some rather decent tools for looking up individual codepoints, like [http://www.fileformat.info/info/unicode/char/0023/index.htm U+0023] or [http://www.fileformat.info/info/unicode/char/20ac/index.htm U+20AC]. Each page includes [http://www.fileformat.info/info/unicode/char/20ac/browsertest.htm a browser test] and [http://www.fileformat.info/info/unicode/char/20ac/fontsupport.htm font support info]. Perhaps it would be useful to link U+F00F the same way we link PMID, ISBN and RFC IDs now. User:Grendelkhan|User_talk:Grendelkhan 16:50, 2005 May 25 (UTC) == Unicode 4.1.0 == Can someone give me a link so that I can download Unicode 4.1.0 for free? _JarlaxleArtemis">User:JarlaxleArtemis 00:14, May 27, 2005 (UTC) : http://www.unicode.org — User:Monedula 05:56, 27 May 2005 (UTC)

Unicode



Discussion page about using Unicode in Wikipedia. ''(Note that this page and the discussion page are covering much the same topic, so you might want to read both. User:M.e 11:02, 14 Jul 2004 (UTC))'' ---- == Unicode question == This may be the wrong place to ask this, or it may be answered elsewhere, but can anyone tell me if and when the English Wiki will be changed over to UTF-8? I ask becuase it's hugely inconvenient to work with text that's full of &#347;'s, but for some topics (Sanskrit and associated languages and subjects, in my case), there is no adequate alternative to using unicode characters. This is true even if I eschew Devanagari and work in roman, because standardized roman transliteration requires characters with diacritics that aren't available in latin-1. User:Kukkurovaca 20:51, 31 Mar 2004 (UTC) ::I'm assuming that moniker is in Tamil, because my Mozilla 1.6 is totally fazed by it. -User:Phil Boswell | User talk:Phil Boswell 14:, Apr 1, 2004 (UTC) :::No, regular everyday Sanskrit, in Devanagari. :As I understand it: Until recently, the general prognosis was "never", but the French Wikipedia recently converted, and I believe it was ''mostly'' successful. So if the remaining problems highlighted by that conversion get ironed out, there may be a possibility that the English 'pedia could make the switch as well if the desire is there. - User:IMSoP 22:08, 31 Mar 2004 (UTC) ::What are the pros and cons? (I am sure this conversation has been had before, so a pointer will plenty). User:Pcb21 User_talk:Pcb21 22:30, 31 Mar 2004 (UTC) :::The pros are that people who edit pages using special characters or non-Roman alphabets can just enter the characters as normal, and it'll just "work," instead of them having to encode the characters using a somewhat random numerical code. For example, the characters in Kukkurovaca's name above must be encoded as &#2325;&#2369;&#2325;&#2381;&#2325;&#2369;&#2352;&#2379;&#2357;&#2366;&#2330; :::I'm not sure of all of the cons, but one is that some older browsers don't support Unicode, in input if not in output; the database back end that Wikipedia uses may not support it either, in which case there would have to be a layer of code that would convert the Unicode-encoding text into something the database can handle when it is stored, and convert that text back into Unicode when it is retrieved. Also, special characters which are already on many pages currently in Wikipedia could go glitchy due to the change. User:Garrett Albright 22:41, 31 Mar 2004 (UTC) ::::Those older browsers are not able to browser half the WWW by now. User:JorUser talk:Jor 12:21, 1 Apr 2004 (UTC) :The masses clamor for Unicode! I'm surprised something so standards-oriented as Wikipedia isn't using it already... User:Garrett Albright 22:23, 31 Mar 2004 (UTC) ::The main reason it isn't Unicode is because the original version of the software didn't support it, and conversion is difficult. It'll require some downtime. There were worries about corruption of the database in various ways, but we have a fairly good handle on that problem now thanks to the recent conversion of the French Wikipedia. I think conversion of the English Wikipedia would be a good idea, some time during the next few months. -- User:Tim Starling 00:04, Apr 1, 2004 (UTC) :::The only Mac browsers able to use Unicode are Safari, Opera etc. on MacOS X, as far as I know, while it is not possible to edit unicode pages with IE. A switch to unicode would be very problematic for many Mac users. User:Ertz 00:12, 1 Apr 2004 (UTC) :::OS 9 has Unicode support; not quite as slick as OS X, no, but it's there. Either way, the number of people still using OS 9 is dwindling rapidly, and will continue to do so. User:Garrett Albright 02:43, 1 Apr 2004 (UTC) :::Which masses have you polled? Unicode would be largely impossible to edit. User:RickK | User talk:RickK 02:45, 1 Apr 2004 (UTC) ::::Howso? I mean, what are the specific drawbacks, other than for the users of older macs?User:Kukkurovaca 03:10, 1 Apr 2004 (UTC) :::::If I were trying to edit a page, and came across something looking like |कुक्कुरोवाच, I would have NO idea what to do with it. User:RickK | User talk:RickK 03:35, 1 Apr 2004 (UTC) ::::::RickK: Just work around it and don't touch it. :) ::::::Judging by your <nowiki> tags, do you mean "something looking like &#2325;&#2369;&#2325;&#2381;&#2325;&#2369;&#2352;&#2379;&#2357;&#2366;&#2330;"? In which case, I'm not sure I see your point. We already use such character entities extensively in articles. The idea of UTF-8 is to allow unicode characters to be inserted without resorting to such ugly constructions. Also, switching en to UTF-8 will make it easier to implement some proposed interwiki features, such as merging the meta recent changes (which is UTF-8) with the local wiki recent changes. -- User:Tim Starling 03:49, Apr 1, 2004 (UTC) :::::::::Doesn't work in Safari, at least not whatever particular language that is. I see the same character (a box surrounding a char I don't recognize) repeated for each character in your sig. Other languages work fine: Japanese, Chinese, Greek, some Cyrillic, but there's one Cyrillic-alphabet-based language that also doesn't work (not sure which it is). That's the problem: support is spotty. If user A enters in text in Japanese natively, what happens when user B who doesn't have Unicode support saves the page? I'm pretty sure the characters would change to little boxes (or whatever the browser displays when it doesn't understand a character) in the textarea, the user would save the page and then ''everybody'' would see the "little boxes." I think it could be a problem waiting to happen. User:RadicalBenderUser talk:RadicalBender 05:02, 1 Apr 2004 (UTC) ::::::::::The web browser does *not* rewrite the characters to "little boxes" when editing -- they are simply shown that way by whatever display mechanism the browser uses. User:Silsor 05:29, Apr 1, 2004 (UTC) ::::::::::RB: Next time you (re)install OS X, make sure to let it install every language file it can. I'm running Safari on OS X, and I see the characters just fine. User:Garrett Albright 05:34, 1 Apr 2004 (UTC) ::::::::I have no idea how to do that. And how many other random Wikipedia editors would? User:RickK | User talk:RickK 04:14, 1 Apr 2004 (UTC) :::::::::The whole point is that if the ''software'' were switched over to UTF, you wouldn't need to interact with these strings or know anything about them at all. They would just work as regular characters. ::::::::::I'm at an utter loss. How would I possibly be able to insert a character that isn't on my keyboard? User:RickK | User talk:RickK 04:56, 1 Apr 2004 (UTC) ::::::::::::Rick, if you're using Windows, then the Character Map applet is your friend. Find the character you want and it will either tell you how to enter it from the keyboard or allow you to copy+paste it. You'll need some nice Unicode fonts, like Junicode, but newer versions of Windows come with ''Lucida Sans Unicode'' anyway. --User:Phil Boswell | User talk:Phil Boswell 14:, Apr 1, 2004 (UTC) :::::::::::In most Windows applications, Left alt + numeric keyboard types (dec) Unicode. alt+0549 is ȥ for example. User:JorUser talk:Jor 12:21, 1 Apr 2004 (UTC) ::::::::::::The prefixing 0 is important by the way: otherwise the Windows encoding is used instead, which wraps around (alt+256 = alt+0) User:JorUser talk:Jor 12:25, 1 Apr 2004 (UTC) :::::::::::::Actually with or without 0 you don't get Unicode, but the systems ANSI and OEM codepages respectively. You can use Wordpad (or anything other which uses a Richedit control), type the hex number for the Unicode character, then type Alt-x.User:Pjacobi 08:43, 14 Jul 2004 (UTC) :::::::::::With a compose key, maybe, or with copy-and-paste. I keep a set of characters I need which I don't have on my keyboard on :cy:Defnyddiwr:Marnanel, and c+p them when I need them in articles. User:Marnanel 05:01, Apr 1, 2004 (UTC) :::::::::::People who use the languages in question know how to type in them. Someone who studies Sanskrit needs to be aware of how to produce the relevant unicode characters. Similarly, someone who writes mathematical articles may need to learn TeX, and someone who works in science may need to produce diagrams. You contribute what you know, it's not necessary to be an encyclopedia to contribute to an encyclopedia. That said, there's a good resource at http://www.alanwood.net/unicode/ . If you go to the test pages, you'll see a list of characters which can be copied and pasted into an edit box. -- User:Tim Starling 05:10, Apr 1, 2004 (UTC) :::::::::::If you were going to work with Sanskrit (or other languages in its family) I would suggest http://www.aczone.com/itrans/online/. Other tools would apply for other languages (there's also http://www.emeld.org/tools/charwrite.cfm for IPA in Unicode, which would offer pan-linguistic functionality of a certain kind.) Of course, it's entirely possible you'll never need to deal with nonstandard characters (in which case it shouldn't make the least differnece to you which encoding the site uses, as your keyboard will suffice in either), but those who contribute to articles that necessarily involve terms from languages that aren't representable with the characters that go into English, there's a basic need, here.User:Kukkurovaca 05:42, 1 Apr 2004 (UTC) Switching the entire project over to UTF-8 or leaving things in ISO-8859-1 are not the only two choices. It would be straightforward to add a user option for "Edit in UTF-8". When a logged-in user with this option set requests to edit a page, the server translates HTML character references to their UTF-8. When the users submits their edit, the server translates non-ASCII (or non-ISO-8859-1) characters back to the HTML character references for storage in the database. Users who don't set this option would see no differences. See my [http://sourceforge.net/tracker/index.php?func=detail&aid=926582&group_id=34373&atid=411195 Editing in UTF-8] feature request. — User:Gdr 12:33 2004-04-01. :For complex scripts, this is a nontrivial operation. This would require the server to change all entities over #255 in Unicode to numeric entities when converting to ISO-8859-1, and likewise to convert all entities back to direct characters when converting to UTF-8. Let alone the problem of combining diacritics and RTL/LTR! User:JorUser talk:Jor 12:41, 1 Apr 2004 (UTC) ::I don't see the difficulty. Numeric character references are trivial to translate since HTML &#x1234; turns into Unicode U+1234 and vice versa. Named character references like &ouml; and &rarr; can be looked up in a table. There's no need to do anything with diacritics and bidirectional text. Just store and transmit the text as it was written and leave it up to the browser to render it. — User:Gdr 13:52 2004-04-01 (UTC) :::I agree with the last part. But that, if anything, is an argument for UTF-8 only rather than for a server-side ISO-8859-1/UTF-8 conversion. Just for argument's sake, browsers that can't handle Unicode won't be affected as UTF-8 is identical to ISO-8859-1 in the first 256 characters. Any chars above that probably will not display correctly for people using archaic browsers anyway. User:JorUser talk:Jor 17:43, 1 Apr 2004 (UTC) ::::I think you misunderstand. The point of having an "edit in UTF-8" option has nothing to do with display. Pages display just fine with the current system. The point is to ''make it easy to enter international text in browsers other than Mozilla''. If the editing page is transmitted in UTF-8, I can type international characters directly into the edit box in many browsers, including Opera, Safari, and Internet Explorer. With the current system (editing page transmitted in ISO-8859-1), I have to convert international characters into the corresponding HTML character entity references. This is tedious. — User:Gdr, 11:44 2004-04-02. :::::Hehe- even early versions of moz are more advanced than IE, not only when it comes to utf-8. IE4 has patchy support, NS4 as well. Nobody editing pages in languages where utf-8 is important uses these browsers though. A check if the posted text validates as utf-8 makes sense imo, throw error otherwise. Just somebody has to write it. Volunteers? -- User:Gwicke 13:24, 2 Apr 2004 (UTC) :::::I guess using Opera made me lazy. I just type non–West European chars like Ł or 匥, and Opera does the conversion to the HTML entity for me if the page is in a non-Unicode charset :). Thanks for clarifying! User:JorUser talk:Jor 19:55, 2 Apr 2004 (UTC) Hi! I am a user from the french wikipédia. I know that some of you were interested by the conversion to utf-8. As you perhaps want to test on your personal wiki before considering the switch, here is the software to convert the MySQL dump : http://mboquien.free.fr/wikiconvert/ . It converts : * html entities, for instance &szlig; => ß, excluding on purpose &gt;, &lt;, &nbsp; and &amp; * unicode entities (decimal or hexadecimal), for instance &#223; => ß * all other caracters valid in your encoding are converted properly What it doesn't do : * bad formatted entities are not converted, typically an entity that doesn't finish with ; * windows-1252 characters are also not converted. To have them corrected before the conversion, you can ask :fr:Utilisateur:Looxix on the french wiki. He has a very good bot to perform this kind of task, if you don't already have one. This version is the rewritten version of the one we used (which was really dirty) to convert the french wiki. I rewrote it this afternoon and i tested it on an old cur dump of the french wiki, everything seems to work as expected. For the details, it depends on Qt (no troll on the toolkit used please) and i ran it on Mandrake 10.0. I was reported that it also compiles out of the box on Slackware. If you use another distribution, you may perhaps need to tweak the Makefile to have the correct path for Qt (you should set QTDIR correctly before trying to compile). No need to say that you need the Qt development packages installed. Using it is quite easy. The Makefile produces a wikiconvert executable. To convert you just need to write : ./wikiconvert < dump > converteddump (if you don't use iso8859-1, there is one line to change in wikiconvert.cpp, as explained in the source). On my computer (an athlonXP 2000+ underclocked at 1,5 GHz), converting a 90 Mb dump of cur lasts about 100 seconds. You should ask for a non compressed dump of cur for your test since converting compressed dumps available at http://download.wikipedia.org/ are not suitable for conversion since, once converted, MySQL can't load the dump completely (a problem of lines too long apparently, last time i tried). I'd be very happy to get some feedback, and i would gladly accept patches to make the program faster/better. :) If you have any question, you can reach me on #fr.wikipedia on Freenode or on my :fr:Discussion_utilisateur:Med (french or english only please). User:Med 09:41, 4 Apr 2004 (UTC) I think the ironic thing is that Wikipedia is already using Unicode. Tagging the pages as ISO-8859-1 and forcing users to use HTML entities just takes up more bandwidth and makes the editing slower. -浪人 update: By now the spanish and the german wikipedia have been converted successfully to utf8. Only dutch, danish, swedish and english still use 8859-1. ---- While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my User:Aramgutang for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the &#0000; format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. --User:Aramgutang 06:46, 8 Aug 2004 (UTC) Pages with special characters

Unicode



''From the Wikipedia:Village Pump'' ''(Note that this discussion page and the article page are covering much the same topic, so you might want to read both. User:M.e 11:03, 14 Jul 2004 (UTC))'' I'm pretty sure I already know the answer, since I've already looked at the UNICODE and HTML pages, but I'm asking. Is there any consistant way to indicate dot-under characters in the wiki and in html in general? I know, dot-under's are not part of ISO-Latin-1, but they're an important part of transliterating Persian. It's not uncommon to ignore them, and that's what I have done, but I've tried _hard_ to get the orthography correct on the various Bahá'í stuff I've done, and the lack of the dot-under's is annoying to me. -- thanks in advance (and yes, I know that sometimes the answer is "no". User:Rboatright 05:05 Feb 24, 2003 (UTC) :The dot under characters are mostly in the "Latin Extended Additional" area of Unicode. Using number codes should do the trick. They fall in the hex range, 1E00-1EFF. Thus Hex 1EA1 = Decimal 7841 gives "ạ", Hex 1E05 = 7685 gives "ḅ", etc. User:Eclecticology 07:49 Feb 24, 2003 (UTC) == Greek unicodes == I have placed a set of Greek alphabet unicodes at the foot of my User page for anyone who works on Greek-related articles and shares my inability to memorise them. User:Adam Carr 03:12, 23 Apr 2004 (UTC) : Wouldn't it be best to use HTML entities, for backwards compatibility? User:Dysprosia 10:28, 23 Apr 2004 (UTC) ::Plus they are a lot easier to remember.User:Theresa knott 11:01, 23 Apr 2004 (UTC) : HTML entities are hard to edit and look ugly in the editing window, not to mention that they are SGML only, and that Unicode can just be copied&pasted in any text editor. User:JorUser talk:Jor 12:21, 23 Apr 2004 (UTC) :What was wrong with the Unicode tables in the Greek alphabet article? User:Gdr 11:56, 2004 Apr 23 (UTC) :: There is nothing ''inherently'' wrong with Unicode, but most people who are on non-Unicode compliant systems can't see Unicode glyphs. User:Dysprosia 12:05, 23 Apr 2004 (UTC) ::: But people using those archaic systems won't be able to access most non-US ASCII websites anyway. Why punish everyone to cator to a very small minority which probably has no interest in reading Greek in the first place? User:JorUser talk:Jor 12:21, 23 Apr 2004 (UTC) :::: That doesn't mean we should actively seek to prevent users on different, non-Unicode-compatible systems from reading the text. I was somewhat sure that Windows 9x versions were not natively Unicode compatible, but [http://www.microsoft.com/globaldev/handson/dev/mslu_announce.mspx] seems to suggest that this is the case. :::: In any case, how are the HTML entities "punishment" in comparison to the Unicode glyphs? One would think that the numerical Unicode entity would be more painful to enter than the slightly more intuitive HTML text-based entity... User:Dysprosia 12:53, 23 Apr 2004 (UTC) :::::You can't save unicode characters into articles on en, the encoding is ISO 8859-1. If you paste in a unicode character, or type it somehow, most browsers will automatically convert it to a numeric character entity. You can type in unicode if you wish, but it means that numeric character entities will be saved (e.g. &#945;) rather than the more readable named character entities, e.g. &alpha;. Unicode support in browsers is irrelevant. -- User:Tim Starling 01:15, Apr 24, 2004 (UTC) : I don't think the named entities are really necessary for typing Greek text: they exist mostly as a coincidental accident because of the fact that Greek letters are used as symbols in a lot of other areas. We type Cyrillic using the numeric entities, for example, because that's the only way to do it, and it doesn't seem like doing the same for Greek is somehow worse. Furthermore, it is not possible to write correct Greek text using only the named entities, because no entities are provided for accented characters, and nearly every Greek word has at least one accent in it (and spelling it without the accent is not correct). Writing a word using all named entities except for one numeric entity in the middle would be kind of odd. --User:Delirium 02:50, Apr 28, 2004 (UTC) == Which Unicode characters can/should we use? == I started a few weeks ago changing various Greek language entries (e.g. in the top line of Jesus, I put Greek language Ἰησοῦς Χριστός Iēsoûs Khristós) to display the proper accent marks. This displays fine in Mozilla. But when I try to display the same pages in Internet Explorer all I get is little squares not Greek letters. Is there an official Wikipedia policy on which Unicode characters we should and should not use? User:M.e 10:58, 24 Jun 2004 (UTC)/User:M.e 08:12, 9 Jul 2004 (UTC) :I can see a few question marks in between the aramaeic spelling, and I have the rather complete MS Arial Unicode font installed. The different display in Mozilla or IE might be a font selection problem, maybe you have set your Mozilla to use a different default font? I am not aware on any official policy on unicode, only that we should limit ourself to the original and the english spelling, as there is not much point in having the Cyrillic spelling of someplace in Greece. If it displays better in most cases you can try it without the accent marks, maybe put the correct version enclosed in a HTML comment behind it. User:Ahoerstemeier 11:33, 24 Jun 2004 (UTC) ::The Mozilla is on Linux and the MSIE is on Windows XP, so I'm not surprised to get different results. I know that some users will be reading Wikipedia using Mosaic on Windows 1.0 and some will have the complete Unicode everything installed. I'm not sure how to strike a compromise in between. User:M.e 12:02, 24 Jun 2004 (UTC) :IE displays a subset of the characters Mozilla displays of unicode on the same machine with the same operating system and the same font. I think this is because Mozilla has a better developed character code mapping table (its had three years' more development). User:MrJones 14:07, 24 Jun 2004 (UTC) :You might find [http://meta.wikipedia.org/wiki/MediaWiki_User%27s_Guide:_Creating_special_characters this page on meta] useful. User:Theresa knott 14:02, 24 Jun 2004 (UTC)   — ''thank you, Theresa, I've read it now; I have been creating the characters using &#xffff;, I was wondering which characters I should and should not use. User:M.e 10:45, 25 Jun 2004 (UTC)'' ::I suggest that MS Arial Unicode is perhaps the ''worst'' font for page compatibility tests because, although it is probably the most complete Unicode font commonly available, it is limited to only those who have a Microsoft product like MS Office 2000 or later installed on their MS Windows IBM-compatible computer. Even though this probably includes more than half the computer user population of the world, it leaves out a huge minority as well. (Personally, I've never gone beyond Office 97, having no compelling reason to pay the huge expense.) Microsoft doesn't seem to offer it as a separately downloadable font, even for a price. (Just another of the thousands of little ways it encourages everyone to buy its major software products.) -- User:Jeffq 21:09, 24 Jun 2004 (UTC) :::Are there any good alternative fonts that are more widely available? Also, is the En wiki ''ever'' going to go UTF-8 like all the others?User:Kukkurovaca|User talk:Kukkurovaca 21:13, 24 Jun 2004 (UTC) :::: [http://www.alanwood.net/unicode/index.html#intro Alan Wood's Unicode Resources] page is an excellent resource for Unicode font issues. His "Introduction" section includes a set of links in the line reading: "Lists of fonts for Windows, Mac OS 9, Mac OS X 10 and Unix, with the Unicode ranges they support, and where to obtain them." -- User:Jeffq 11:22, 25 Jun 2004 (UTC) :::::On the basis of this, it appears that IE is rendering Greek but not Extended Greek. According to [http://www.alanwood.net/unicode/index.html the Alanwood pages] that User:Jeffq referred to, Arial Unicode MS should render both Greek and Extended Greek correctly. Does the Wikipedia CSS force IE to another font that does not have Extended Greek? Also, I notice that Wikipedia pages have charset=iso-8859-1 in the header, but I presume this doesn't matter as I am coding my characters as &#x0000; codes rather than directly inserting the characters themselves. ::::: I suppose this means we need a rule that says only use the characters supported by Arial???? User:M.e 10:03, 27 Jun 2004 (UTC) ::::::Font rendering is an incredibly complex, multidimensional problem that is far from being adequately solved, especially for a global Web resource like Wikipedia. You can't really speak of what IE will render; you've got to specify what version it is, what platform you're running on, what fonts you have installed (by manufacturer name, not style), how your browser is configured to render certain types of fonts, what language it's set to, and so on. (I can see that you, User:M.e, know much of this already, but I state it h