Diff for "UTFEightCurrentProblems"

UTFEightCurrentProblems

Differences between revisions 1 and 3 (spanning 2 versions)

Currently known UTF-8 problems in Ubuntu

A lot of web pages, like http://se.php.net/manual/sv/introduction.php, use the ISO-8859-1 charset with international/special characters without specifiying any charset (that can be done in the <meta http-equiv="content-type" ... > tag). Maybe ISO-8859-1 should be the default here?
- ["UTFEightByDefault"] does not mean using it everywhere. In places where the encoding is specified unambigously other encodings are ok. RFC2616 defines 8859-1 as the default encoding for HTML so the current situation appears ok.
- The HTML4 spec, however, recommends not relying on this default, so adding a meta tag would be nice.
- If HTML pages are translated to UTF-8 a meta tag would be REQUIRED.
- Another alternative is to use numeric entity references for all non-ascii characters
- HTMLTidy is your friend! http://www.w3.org/People/Raggett/tidy/
Midnight Commander doesn't work well in UTF-8 locales. The development team has a roadmap which considers fixing this in the upcoming version 1.7.0, which will be developed after the 1.6.1 bugfix release. Patches exist already, so a 1.7.0-pre1 could be out soon after the 1.6.1 release
LaTeX is capable of handling UTF-8 using \inputenc[utf8], however, this is apparently not enough for languages like greek and japanese. Solutions are welcome.
GTK+ 1.x unicode capabilities are very limited, GTK+2 is a better choice. However several applications are still using GTK+1, so these will have to be ported or replaced by a better solution.
- XMMS can be replaced by Beep Media Player, which is GTK+2 port of XMMS itself
- GnuCash (even though in universe) does have a development branch for Gnome 2.x compatibility, but it's nowhere near ready. The project needs manpower for that!
No MP3 tagging utility available can handle Unicode in ID3v2 tags correctly (ID3v1 doesn't have any unicode capabilities), many of them accept UTF-8 text, but the encoding bit is not set. The problem seems to originate in id3lib's incomplete unicode support. The solution is to port the tools to using mplib or taglib from id3lib.
- eyeD3 (http://eyed3.nicfit.net/) can work with unicode in tags. it's a command-line tool written in python
- please note that to use utf-8 in mp3 tags, you need id3v2.4. earlier versions, like id3v2.3 can only use utf16 as a unicode-based encoding.
- I've just done a patch to let lame handle Unicode in tags. You can find it under "Patches" here: http://sourceforge.net/projects/lame. The patch doesn't use UTF-8 in the tags. It uses UCS-2 (not UTF-16), which is included in id3v2.2, at least.
What's wrong with utf16? What's really important is what encoding is used in mp3s "out in the wild". If they use utf16 then that's the encoding that should be used, not utf8. Does anyone know what's the most common *unambigous* encoding? (the most common encoding is probably just the codepage of the machine they were created on, without specifying what it is...)
ispell is not locale-aware, so, for example, I cannot use ilithuanian from Hoary universe (which contains a dictionary in ISO-8859-13) to check the spelling of Lithuanian texts in UTF-8.
aspell 0.50 does not support UTF-8 (quote: "aspell 0.50 completely fails to even check the non-UTF-8 parts"). aspell 0.60 does support it. Status of aspell in Debian 0.60 is described at http://bugs.debian.org/274514.
I don't know if it's come up at all, but... UTF-8 kind of sucks if you're using a non-Latin-1 script. It handles Unicode characters less than 255 as single characters, at the expense of all the other characters above that. So speakers of Greek, Russian, Chinese, Korean, Japanese, most other Asian languages, Arabic, Hebrew, etc. etc. all get screwX0r'd when they get forced to use UTF-8 instead of Unicode. IWBNI there was support for regular-old uncompressed unicode for those languages that don't use Latin-1 as their default charset. --EvanProdromou
- Not quite. UTF-8 handles Unicode characters less than 128 as single bytes, and Unicode characters less than 65536 as no more than 3 bytes. So Greek, Russian, Arabic and Hebrew are slightly shorter in UTF-8 than in UCS-2 or UTF-16, and Chinese, Korean and Japanese are less than 50% longer (characters above 65535 are rare even in Chinese).

CategoryArchive

UTFEightCurrentProblems (last edited 2008-08-06 16:36:43 by localhost)

-  ⇤ ← Revision 1 as of 2005-05-28 21:02:26 → 
  Size: 4386
  Editor: adsl-213-190-44-43
  Comment: imported from the old wiki
+   ← Revision 3 as of 2005-09-26 17:47:22 → ⇥
  Size: 4378
  Editor: p549693DA
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= UTFEightCurrentProblems =
-Line 7:
+Line 5:
-  * UTFEightByDefault does not mean using it everywhere. In places where the encoding is specified unambigously other encodings are ok. RFC2616 defines 8859-1 as the default encoding for HTML so the current situation appears ok.
+  * ["UTFEightByDefault"] does not mean using it everywhere. In places where the encoding is specified unambigously other encodings are ok. RFC2616 defines 8859-1 as the default encoding for HTML so the current situation appears ok.
-Line 36:
+Line 34:
+CategoryArchive

Ubuntu Wiki

UTFEightCurrentProblems

Currently known UTF-8 problems in Ubuntu