Menneisyys
10-13-2006, 06:22 PM
Web authors, -masters and Pocket PC Web browser users attention: everything you need to know about internationalization and "special character" issues in current PPC Web browsers
Over at the AximSite forums, I've been presented an interesting bug in famous Pocket PC Web browser NetFront (http://www.aximsite.com/boards/showthread.php?p=1210710), which made me experiment with the internationalization (i18n for short) issues of all Pocket PC Web browsers (and, for that matter, all the three most important desktop Windows ones). I've long been planning a test like this to see how Pocket PC Web browsers compare to desktop browsers in terms of i18n issues.
Elaborating on these issues is not just an über-geek, useless waste of time but can prove very useful if you, for example, speak a non-Western language and would like to read pages written in them or post messages on Web boards in the given language using your Pocket PC.
You can run into these problems even if you don't plan to post in any non-Western language or non-English forum / pages as can be seen in the above-linked AximSite example. All it takes a poster to use Word to compose his or her posts or articles and you end up seeing square characters (or simply nothing) instead of apostrophes and other, special, but otherwise Western characters if you use NetFront as a client. (Note that earlier, Minimo also had very similar problems with UTF-8-encoded pages I've elaborated on in here (http://www.pocketpcmag.com/blogs/index.php?blog=3&p=553&more=1&c=1&tb=1&pb=1). These have been in the meantime, thanks to my bug reports, fixed. Also, very early versions of Minimo couldn't render non-Western characters on any page encoded in 8-bit as you can also see in my well-known Web Browser Bible. Those problems have long been fixed.)
Note that my inability to speak / write / read any Middle-East language (Arabic, Hebrew) and write/read Far-East languages like Chinese or Korean, I could only check non-Western, but still left-to-right languages like Russian. That is, in here, I'm unable to elaborate on the issues of these Web browsers outside the Western / Central / Eastern-European language groups. Sorry for this - not even I can speak more than 7-8 (European ones / Japanese) languages :)
1. URLs with accents
My first test was finding out whether you can enter URL's with accents in them into a given browser. (I recommend for example this article (http://filips.net/archives/2005/12/01/why-nordic-characters-appear-to-work-only-in-some-web-browsers/) on the subject for more info. It's its test URL (http://www.filips.net/åäö.html) that I've used.)
Note that it's highly unlikely you'll ever see any URL's with accents in them (that is, this problem is pretty much non-existent); still, it's nice to know which browsers are able to render these pages. Yes, being able to use the built-in PIE as a "fallback" browser in these cases is highly advantageous.
The results (+ means compatible, - means not compatible):
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t1.html)
As can be seen, only PIE supports this (as opposed to the desktop version) and Minimo (which I've expected, given that Minimo is the closest to its desktop version of all the Pocket PC browsers available).
(Note that, on he desktop Windows, Desktop IE7 RC1 and Opera 9.02 don't support this by default without explicit reconfiguration (see the above article on this); Mozilla / Firefox does.)
2. Displaying non-standard Western and non-Western characters
The second set of compliance tests is way more interesting and important than the first.
Note that this explanation will be a bit on the technical side; without some knowledge of HTTP and the HTML meta tags, you should skip the explanation and move straight to the summary column of the final chart (and the section following the chart). That is, don't read the following section if you don't know what HTTP is! Web browser developers (particularly those from Access!), website administrators and Web authors, on the other hand, should definitely read it in order to be absolutely sure the non-standard characters (again, it can be "plain" punctuation created by Word, not just non-Western languages!) contained in their documents are correctly rendered by all the browsers.
2.1 Test method
This test shows
whether the given browser takes into account the value of the
"Content-Type" HTTP response header
http-equiv meta tag
what the browser assumes (what it defaults to) when neither of them are defined (which is very often the case)
on the other hand, if there are both of them with different values, does the metatag-based override the HTTP response header.
For the test, I've written a custom HTTP server emulator accessed from all the tested Web browser applications. As usual, I'm making the source available (http://www.winmobiletech.com/sekalaiset/CharsetHTTPEmu.java) so that you can freely test it if you prefer.
2.1.1 How my custom server emulator should be used?
The application listens to incoming HTTP requests at port 82. It requires a custom parameter (as extra path info - that is, you don't need to use the ? but / instead) in the following form:
http://127.0.0.1:82/xyz
(change 127.0.0.1 to the Internet address of your desktop PC's address if you want to access it from your PDA)
where
x tells the server emulator to set the ‘charset' attribute of the HTTP response header Content-Type. (Doesn't set it at all if you pass a ‘N' instead.)
y tells the program emulator to set the ‘charset' attribute of the "Content-Type" http-equiv meta tag. (Doesn't set it at all if you pass a ‘N' instead.) This is the only way for a plain user (not having access to the Web server configuration) to set a charset for a given page.
z tells the server what character encoding to use internally. In most cases, you can safely keep it as ‘2' for the test.
All values are one-digit numerals; I've tested the browsers with value ‘1' (Western charset) and ‘2' (Central-European charset). As has already been mentioned, with the first two digits, you can also supply ‘N' (which stands for ‘No', ‘Doesn't exist' or ‘Not known'); then, the given HTTP header / HTTP-Equiv tag won't be set / returned. If you set something other than ‘1' or ‘2' for the ‘z' digit, it'll default back to ‘1'.
For example, if you enter http://127.0.0.1:82/111 in your Internet Explorer browser running on the same machine as the server emulator, you'll see this (http://www.winmobiletech.com/102006Browseri18n/IE7RC1-11N.png).
In here, there are three rows of special interest (the fourth, the date row is only included so that you can be absolutely sure the browser doesn't just returning a cached document):
"8-bit 8859-1-only punctuation marks" contains strictly 8859-1 punctuation marks. You should see real punctuation marks (if you render the page as a 8859-1 page) after this introduction: no squares, no question marks, no nothing.
"Central-European chars (will ONLY work with the third parameter being 2):" contains two Central-European characters. You'll see them rendered in three ways: as question marks (if you pass anything but ‘2' as the third, ‘z' parameter), as http://www.winmobiletech.com/102006Browseri18n/88951-WesternEncoding.png (the closest Western rendering of these characters) and, finally, as http://www.winmobiletech.com/102006Browseri18n/8859-2-Encoding.png, which is how they should be rendered. (Finally, as a - mark (hyphen) in Thunderhawk because it doesn't contain any non-Western character in its custom character set.)
Finally, the third row will be always displayed the same way because it explicitly uses HTML Unicode character entities, which aren't affected by the language setting. That is, it'll be always rendered as http://www.winmobiletech.com/102006Browseri18n/Unicode-Encoding.png
2.2 The chart
In the following chart, I've (keeping ‘z' as ‘2' in all the cases - again, it doesn't have direct affect on any HTTP header or HTML meta-tag, only the way Java encodes the returned contents) tested all the available combinations. This way, I was able to see
whether the HTTP-Equiv meta tag overrides the "Content-Type" HTTP response header (nope, only in NetFront - this is a big difference in NetFront and any other Web browser and should be modified by the NetFront people! Other browsers only take into account the meta tag if there is absolutely no charset parameter in the HTTP Content-Type response header)
if none of the two alternates are used, what charset the browser defaults to (fortunately, Western charset in all cases)
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t2.html)
2.2.1 Explanation for the chart
2.2.1.1 NetFront and Content-Type (charset) overriding: different from the approach of all other browsers!
Again and again, NetFront works differently from all other browsers, header overriding-wise. That is, something that works on a desktop browser or any other Pocket PC browser will not necessarily work on NetFront if the HTML document contains a Content-Type metatag. Again, NetFront will override the encoding of the page with the value found in this metatag, unlike any other browser. This seems to be the reason why several people have reported char encoding problems with NetFront.
It should also be pointed out that the overriding being non-standards-compliance aside (which should be fixed by the Access folks - the developers of NetFront - as soon as possible), NetFront isn't able to display any extended punctuation mark contained in the 8859-1 codepage if you don't use the default (and ugly) Courier New font but switch to a proportional font. No matter what language info you return from the Web server, these characters will just not be displayed. A quick fix for this problem (before the Access folks fix this bug) is forcing the browser to, say, use the Central European (windows-1250 as opposed to 8859-2), Baltic or Greek encoding (but not to UTF-8, which hides all these chars) encoding as can be seen in here (http://www.winmobiletech.com/102006Browseri18n/NF33-NN-ForcedCentralEuMode.bmp.png) (note that now the punctuation is displayed in the background, on the Web page).
2.2.1.1.1 How Web administrators should treat NetFront clients?
This also means if you're a Web hoster / Web author and would like to allow your NetFront users to be able to browse your otherwise Western (8859-1) pages and there aren't any, say, French and Spanish names / texts on the website (with all those funny accented characters), you should consider marking these pages as, say, Central-European (windows-1250) or Baltic if and only if your client using NetFront (you can easily see this by checking for the User-Agent HTTP request header as has already been explained in several of my User-Agent header-related articles; for example, this one (http://www.pocketpcmag.com/blogs/index.php?blog=3&p=796&more=1&c=1&tb=1&pb=1)).
2.2.1.2 How Web administrators should treat Thunderhawk clients?
Incidentally, speaking of server-side User-Agent checking, if you operate a, say, Web site offering content in a Central-European language (that is, a language sufficiently close to Western languages, alphabet-wise; that is, in where using Western characters instead of some special, local characters - for example, using http://www.winmobiletech.com/102006Browseri18n/88951-WesternEncoding.png instead of the "official" http://www.winmobiletech.com/102006Browseri18n/8859-2-Encoding.png), then, upon sensing the client's using Thunderhawk, you may want to force the content encoding back to 8859-1. That is, return the content as plain 8859-1 (Western) document. This will make sure non-Western characters will be converted back to their closest Western equivalent, resulting in a far better user experience.
2.2.1.3 Generic advice for Webadmins of web hosters: NEVER set the charset attribute in the Content-Type HTTP response header!
Also, the results clearly show if you're providing Web hosting service to customers, you should never set the character set the ‘charset' attribute in the Content-Type HTTP response because your customers won't be able to override it with their own encodings. For example, if you're a Central-European Webspace provider and use the local, non-Western language as default charset returned straight in the HTTP header, you'll make all your customers unable to provide content in the Western-European charset.
The situation is the same in the opposite direction: if you set the Western-European charset and you still have for example Russian customers that would like to publish their Russian pages on your server, most Web browsers won't be able to render these, not even if these folks explicitly try to override your encoding settings.
A real-world example: earlier, this all has been a huge problem with my current webspace provider. Due to cost considerations (I didn't want to pay big bucks to host a webpage (http://www.winmobiletech.com/) I don't use for commercial stuff, just as a database back-end for my articles, images and other downloads), I've chosen a (compared to, for example, the Finnish Web hosting fees) very cheap Central-European webspace provider. It, however, before July 2006, also set the above-mentioned ‘charset' attribute to Central-European encoding, which made it impossible to put Word-generated/-exported English language pages on my page without first changing all the non-standard (extended) punctuation marks to their non-extended (and, therefore, less spectacular) counterparts. That is, when I, for example, posted a comparison chart HTML file there (I can't include comparison charts in Pocket PC discussion forums because of the forum restrictions and wide charts in my Smartphone and Pocket PC Magazine Blog (http://www.pocketpcmag.com/blogs/) - in these cases, I must link them from my back-end), I always had to change (with a generic search-replace in, say, Notepad) these characters back to the non-extended version.
Note that you can avoid all this hassle in Microsoft Word before starting to write your article / post by disabling the two checkboxes ("Straight quotes" with "Smart quotes" and Hyphens (--) with dash (-)) in Tools / Autocorrect Options / Autoformat As You Type (http://www.winmobiletech.com/102006Browseri18n/WordDisableAutoformat.png). If you find it necessary and want to avoid problems with, for example, your NetFront or non-Western readers, make sure you do this (that is, disable the two checkboxes).
Now that, during a recent Web server update, this annoyance has been removed, I can post anything without mass-replacing 'special' characters (or disabling autocorrection in Word).
All in all: the inability to override (in everything but NetFront) the HTTP response header means a decent web hoster should never return the language / character encoding parameter in his / her Web pages if there is chance users would want to return pages in different encoding. This is, fortunately, the case with the majority of current web hosters.
3. Form Posting
The second main problem area is not displaying non-Western characters but posting such contents to Web servers via Web forms. In these tests, I've tested the same; with exactly the same input (punctuation, non-Western characters) and I've also added Western accented characters like ä and ö: I've posted these characters to the Web server and checked what they have become.
I've used two sites for this purpose: my PPCMag blog (http://www.pocketpcmag.com/blogs), which is 8859-1 (Western charset) by default and a Central-European server using the Windows-1250 (Central-European) encoding on all its pages. (The page encoding setting has direct effect on what is uploaded back to the server from a form.)
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t3.html)
As can clearly be seen, the situation isn't at all bad with the three desktop browsers (I've encountered no compatibility issues at all - all special and even Central-European characters were visible in their original - unconverted! - form after posting, even when posted to an 8859-1 server).
With Pocket PC-based browsers, on the other hand, posting special / non-Western characters turned out to be much more problematic, particularly - as opposed to the desktop posting case - when posting to a 8859-1 (Western) server. Then, it was only Thunderhawk that was able to upload these characters; all the other clients either don't upload these characters at all or converted them.
An explicitly windows-1250 server was a bit better as far as NetFront is concerned: now, it was able to upload non-standard Western punctuation and non-Western accents.
Unfortunately, Opera Mobile has never been able to upload any kind of extended Western punctuation. This is a real bug that really should be fixed.
4. Non-8-bit file formats
In addition to 8-bit file formats (ISO-8859-1, Windows-1250 etc), there are some other, non-8-bit file formats. One of them is Unicode, of which a test page is here (http://www.winmobiletech.com/sekalaiset/i18n/Unicode.html). Another is UTF-8 (test page here (http://www.winmobiletech.com/sekalaiset/i18n/utf-8.html); OK, I know this is, technically, a 8-bit file format, using 2 or 3 bytes for extended 8859-1 or Unicode characters. I didn't want to create a different category for it.)
The former is almost never used on the Web (albeit it's possible some, say, Chinese or Japanese site will use it); the latter pretty extensively in the non-Western language areas. Its penetration in Central-Europe (excluding languages using cyrillic characters) may be 10-20% (because the special characters of these languages are easy to map into a 8-bit chartable); in China/Japan or other languages using alphabets containing Kanjis and other special (and numerous) characters, around 100%.
In this test, I've tested whether the Pocket PC web browsers are able to read these pages (see the above two links if you want to give them a try). As can be seen, the situation is pretty good: the common UTF-8 is read by all browsers. IEM and Minimo fail to render Unicode files, though.
This is, again, not a big problem at all - I've yet to see a Web page that uses Unicode instead of UTF-8. Note that this is one of the very slight differences between Minimo and its desktop big brother, Mozilla / Firefox. The latter, as with the desktop Opera and Internet Explorer, is able to render Unicode files too.
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t4.html)
Over at the AximSite forums, I've been presented an interesting bug in famous Pocket PC Web browser NetFront (http://www.aximsite.com/boards/showthread.php?p=1210710), which made me experiment with the internationalization (i18n for short) issues of all Pocket PC Web browsers (and, for that matter, all the three most important desktop Windows ones). I've long been planning a test like this to see how Pocket PC Web browsers compare to desktop browsers in terms of i18n issues.
Elaborating on these issues is not just an über-geek, useless waste of time but can prove very useful if you, for example, speak a non-Western language and would like to read pages written in them or post messages on Web boards in the given language using your Pocket PC.
You can run into these problems even if you don't plan to post in any non-Western language or non-English forum / pages as can be seen in the above-linked AximSite example. All it takes a poster to use Word to compose his or her posts or articles and you end up seeing square characters (or simply nothing) instead of apostrophes and other, special, but otherwise Western characters if you use NetFront as a client. (Note that earlier, Minimo also had very similar problems with UTF-8-encoded pages I've elaborated on in here (http://www.pocketpcmag.com/blogs/index.php?blog=3&p=553&more=1&c=1&tb=1&pb=1). These have been in the meantime, thanks to my bug reports, fixed. Also, very early versions of Minimo couldn't render non-Western characters on any page encoded in 8-bit as you can also see in my well-known Web Browser Bible. Those problems have long been fixed.)
Note that my inability to speak / write / read any Middle-East language (Arabic, Hebrew) and write/read Far-East languages like Chinese or Korean, I could only check non-Western, but still left-to-right languages like Russian. That is, in here, I'm unable to elaborate on the issues of these Web browsers outside the Western / Central / Eastern-European language groups. Sorry for this - not even I can speak more than 7-8 (European ones / Japanese) languages :)
1. URLs with accents
My first test was finding out whether you can enter URL's with accents in them into a given browser. (I recommend for example this article (http://filips.net/archives/2005/12/01/why-nordic-characters-appear-to-work-only-in-some-web-browsers/) on the subject for more info. It's its test URL (http://www.filips.net/åäö.html) that I've used.)
Note that it's highly unlikely you'll ever see any URL's with accents in them (that is, this problem is pretty much non-existent); still, it's nice to know which browsers are able to render these pages. Yes, being able to use the built-in PIE as a "fallback" browser in these cases is highly advantageous.
The results (+ means compatible, - means not compatible):
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t1.html)
As can be seen, only PIE supports this (as opposed to the desktop version) and Minimo (which I've expected, given that Minimo is the closest to its desktop version of all the Pocket PC browsers available).
(Note that, on he desktop Windows, Desktop IE7 RC1 and Opera 9.02 don't support this by default without explicit reconfiguration (see the above article on this); Mozilla / Firefox does.)
2. Displaying non-standard Western and non-Western characters
The second set of compliance tests is way more interesting and important than the first.
Note that this explanation will be a bit on the technical side; without some knowledge of HTTP and the HTML meta tags, you should skip the explanation and move straight to the summary column of the final chart (and the section following the chart). That is, don't read the following section if you don't know what HTTP is! Web browser developers (particularly those from Access!), website administrators and Web authors, on the other hand, should definitely read it in order to be absolutely sure the non-standard characters (again, it can be "plain" punctuation created by Word, not just non-Western languages!) contained in their documents are correctly rendered by all the browsers.
2.1 Test method
This test shows
whether the given browser takes into account the value of the
"Content-Type" HTTP response header
http-equiv meta tag
what the browser assumes (what it defaults to) when neither of them are defined (which is very often the case)
on the other hand, if there are both of them with different values, does the metatag-based override the HTTP response header.
For the test, I've written a custom HTTP server emulator accessed from all the tested Web browser applications. As usual, I'm making the source available (http://www.winmobiletech.com/sekalaiset/CharsetHTTPEmu.java) so that you can freely test it if you prefer.
2.1.1 How my custom server emulator should be used?
The application listens to incoming HTTP requests at port 82. It requires a custom parameter (as extra path info - that is, you don't need to use the ? but / instead) in the following form:
http://127.0.0.1:82/xyz
(change 127.0.0.1 to the Internet address of your desktop PC's address if you want to access it from your PDA)
where
x tells the server emulator to set the ‘charset' attribute of the HTTP response header Content-Type. (Doesn't set it at all if you pass a ‘N' instead.)
y tells the program emulator to set the ‘charset' attribute of the "Content-Type" http-equiv meta tag. (Doesn't set it at all if you pass a ‘N' instead.) This is the only way for a plain user (not having access to the Web server configuration) to set a charset for a given page.
z tells the server what character encoding to use internally. In most cases, you can safely keep it as ‘2' for the test.
All values are one-digit numerals; I've tested the browsers with value ‘1' (Western charset) and ‘2' (Central-European charset). As has already been mentioned, with the first two digits, you can also supply ‘N' (which stands for ‘No', ‘Doesn't exist' or ‘Not known'); then, the given HTTP header / HTTP-Equiv tag won't be set / returned. If you set something other than ‘1' or ‘2' for the ‘z' digit, it'll default back to ‘1'.
For example, if you enter http://127.0.0.1:82/111 in your Internet Explorer browser running on the same machine as the server emulator, you'll see this (http://www.winmobiletech.com/102006Browseri18n/IE7RC1-11N.png).
In here, there are three rows of special interest (the fourth, the date row is only included so that you can be absolutely sure the browser doesn't just returning a cached document):
"8-bit 8859-1-only punctuation marks" contains strictly 8859-1 punctuation marks. You should see real punctuation marks (if you render the page as a 8859-1 page) after this introduction: no squares, no question marks, no nothing.
"Central-European chars (will ONLY work with the third parameter being 2):" contains two Central-European characters. You'll see them rendered in three ways: as question marks (if you pass anything but ‘2' as the third, ‘z' parameter), as http://www.winmobiletech.com/102006Browseri18n/88951-WesternEncoding.png (the closest Western rendering of these characters) and, finally, as http://www.winmobiletech.com/102006Browseri18n/8859-2-Encoding.png, which is how they should be rendered. (Finally, as a - mark (hyphen) in Thunderhawk because it doesn't contain any non-Western character in its custom character set.)
Finally, the third row will be always displayed the same way because it explicitly uses HTML Unicode character entities, which aren't affected by the language setting. That is, it'll be always rendered as http://www.winmobiletech.com/102006Browseri18n/Unicode-Encoding.png
2.2 The chart
In the following chart, I've (keeping ‘z' as ‘2' in all the cases - again, it doesn't have direct affect on any HTTP header or HTML meta-tag, only the way Java encodes the returned contents) tested all the available combinations. This way, I was able to see
whether the HTTP-Equiv meta tag overrides the "Content-Type" HTTP response header (nope, only in NetFront - this is a big difference in NetFront and any other Web browser and should be modified by the NetFront people! Other browsers only take into account the meta tag if there is absolutely no charset parameter in the HTTP Content-Type response header)
if none of the two alternates are used, what charset the browser defaults to (fortunately, Western charset in all cases)
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t2.html)
2.2.1 Explanation for the chart
2.2.1.1 NetFront and Content-Type (charset) overriding: different from the approach of all other browsers!
Again and again, NetFront works differently from all other browsers, header overriding-wise. That is, something that works on a desktop browser or any other Pocket PC browser will not necessarily work on NetFront if the HTML document contains a Content-Type metatag. Again, NetFront will override the encoding of the page with the value found in this metatag, unlike any other browser. This seems to be the reason why several people have reported char encoding problems with NetFront.
It should also be pointed out that the overriding being non-standards-compliance aside (which should be fixed by the Access folks - the developers of NetFront - as soon as possible), NetFront isn't able to display any extended punctuation mark contained in the 8859-1 codepage if you don't use the default (and ugly) Courier New font but switch to a proportional font. No matter what language info you return from the Web server, these characters will just not be displayed. A quick fix for this problem (before the Access folks fix this bug) is forcing the browser to, say, use the Central European (windows-1250 as opposed to 8859-2), Baltic or Greek encoding (but not to UTF-8, which hides all these chars) encoding as can be seen in here (http://www.winmobiletech.com/102006Browseri18n/NF33-NN-ForcedCentralEuMode.bmp.png) (note that now the punctuation is displayed in the background, on the Web page).
2.2.1.1.1 How Web administrators should treat NetFront clients?
This also means if you're a Web hoster / Web author and would like to allow your NetFront users to be able to browse your otherwise Western (8859-1) pages and there aren't any, say, French and Spanish names / texts on the website (with all those funny accented characters), you should consider marking these pages as, say, Central-European (windows-1250) or Baltic if and only if your client using NetFront (you can easily see this by checking for the User-Agent HTTP request header as has already been explained in several of my User-Agent header-related articles; for example, this one (http://www.pocketpcmag.com/blogs/index.php?blog=3&p=796&more=1&c=1&tb=1&pb=1)).
2.2.1.2 How Web administrators should treat Thunderhawk clients?
Incidentally, speaking of server-side User-Agent checking, if you operate a, say, Web site offering content in a Central-European language (that is, a language sufficiently close to Western languages, alphabet-wise; that is, in where using Western characters instead of some special, local characters - for example, using http://www.winmobiletech.com/102006Browseri18n/88951-WesternEncoding.png instead of the "official" http://www.winmobiletech.com/102006Browseri18n/8859-2-Encoding.png), then, upon sensing the client's using Thunderhawk, you may want to force the content encoding back to 8859-1. That is, return the content as plain 8859-1 (Western) document. This will make sure non-Western characters will be converted back to their closest Western equivalent, resulting in a far better user experience.
2.2.1.3 Generic advice for Webadmins of web hosters: NEVER set the charset attribute in the Content-Type HTTP response header!
Also, the results clearly show if you're providing Web hosting service to customers, you should never set the character set the ‘charset' attribute in the Content-Type HTTP response because your customers won't be able to override it with their own encodings. For example, if you're a Central-European Webspace provider and use the local, non-Western language as default charset returned straight in the HTTP header, you'll make all your customers unable to provide content in the Western-European charset.
The situation is the same in the opposite direction: if you set the Western-European charset and you still have for example Russian customers that would like to publish their Russian pages on your server, most Web browsers won't be able to render these, not even if these folks explicitly try to override your encoding settings.
A real-world example: earlier, this all has been a huge problem with my current webspace provider. Due to cost considerations (I didn't want to pay big bucks to host a webpage (http://www.winmobiletech.com/) I don't use for commercial stuff, just as a database back-end for my articles, images and other downloads), I've chosen a (compared to, for example, the Finnish Web hosting fees) very cheap Central-European webspace provider. It, however, before July 2006, also set the above-mentioned ‘charset' attribute to Central-European encoding, which made it impossible to put Word-generated/-exported English language pages on my page without first changing all the non-standard (extended) punctuation marks to their non-extended (and, therefore, less spectacular) counterparts. That is, when I, for example, posted a comparison chart HTML file there (I can't include comparison charts in Pocket PC discussion forums because of the forum restrictions and wide charts in my Smartphone and Pocket PC Magazine Blog (http://www.pocketpcmag.com/blogs/) - in these cases, I must link them from my back-end), I always had to change (with a generic search-replace in, say, Notepad) these characters back to the non-extended version.
Note that you can avoid all this hassle in Microsoft Word before starting to write your article / post by disabling the two checkboxes ("Straight quotes" with "Smart quotes" and Hyphens (--) with dash (-)) in Tools / Autocorrect Options / Autoformat As You Type (http://www.winmobiletech.com/102006Browseri18n/WordDisableAutoformat.png). If you find it necessary and want to avoid problems with, for example, your NetFront or non-Western readers, make sure you do this (that is, disable the two checkboxes).
Now that, during a recent Web server update, this annoyance has been removed, I can post anything without mass-replacing 'special' characters (or disabling autocorrection in Word).
All in all: the inability to override (in everything but NetFront) the HTTP response header means a decent web hoster should never return the language / character encoding parameter in his / her Web pages if there is chance users would want to return pages in different encoding. This is, fortunately, the case with the majority of current web hosters.
3. Form Posting
The second main problem area is not displaying non-Western characters but posting such contents to Web servers via Web forms. In these tests, I've tested the same; with exactly the same input (punctuation, non-Western characters) and I've also added Western accented characters like ä and ö: I've posted these characters to the Web server and checked what they have become.
I've used two sites for this purpose: my PPCMag blog (http://www.pocketpcmag.com/blogs), which is 8859-1 (Western charset) by default and a Central-European server using the Windows-1250 (Central-European) encoding on all its pages. (The page encoding setting has direct effect on what is uploaded back to the server from a form.)
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t3.html)
As can clearly be seen, the situation isn't at all bad with the three desktop browsers (I've encountered no compatibility issues at all - all special and even Central-European characters were visible in their original - unconverted! - form after posting, even when posted to an 8859-1 server).
With Pocket PC-based browsers, on the other hand, posting special / non-Western characters turned out to be much more problematic, particularly - as opposed to the desktop posting case - when posting to a 8859-1 (Western) server. Then, it was only Thunderhawk that was able to upload these characters; all the other clients either don't upload these characters at all or converted them.
An explicitly windows-1250 server was a bit better as far as NetFront is concerned: now, it was able to upload non-standard Western punctuation and non-Western accents.
Unfortunately, Opera Mobile has never been able to upload any kind of extended Western punctuation. This is a real bug that really should be fixed.
4. Non-8-bit file formats
In addition to 8-bit file formats (ISO-8859-1, Windows-1250 etc), there are some other, non-8-bit file formats. One of them is Unicode, of which a test page is here (http://www.winmobiletech.com/sekalaiset/i18n/Unicode.html). Another is UTF-8 (test page here (http://www.winmobiletech.com/sekalaiset/i18n/utf-8.html); OK, I know this is, technically, a 8-bit file format, using 2 or 3 bytes for extended 8859-1 or Unicode characters. I didn't want to create a different category for it.)
The former is almost never used on the Web (albeit it's possible some, say, Chinese or Japanese site will use it); the latter pretty extensively in the non-Western language areas. Its penetration in Central-Europe (excluding languages using cyrillic characters) may be 10-20% (because the special characters of these languages are easy to map into a 8-bit chartable); in China/Japan or other languages using alphabets containing Kanjis and other special (and numerous) characters, around 100%.
In this test, I've tested whether the Pocket PC web browsers are able to read these pages (see the above two links if you want to give them a try). As can be seen, the situation is pretty good: the common UTF-8 is read by all browsers. IEM and Minimo fail to render Unicode files, though.
This is, again, not a big problem at all - I've yet to see a Web page that uses Unicode instead of UTF-8. Note that this is one of the very slight differences between Minimo and its desktop big brother, Mozilla / Firefox. The latter, as with the desktop Opera and Internet Explorer, is able to render Unicode files too.
Click here for the chart (http://www.winmobiletech.com/102006Browseri18n/t4.html)