====== Character Sets / Character Encoding Issues ======
===== Introduction =====
Let's first define some terms to make it easier to understand the following sections (taken from the book [[http://www.amazon.com/exec/obidos/tg/detail/-/0672320967/qid=1113818919/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/102-0603664-5628167?v=glance&s=books&n=507846|XML Internationalization and Localization]]). See also the introductory [[php:i18n|WIKI page on i18n]].
A **character** is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc.
A **character set** is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.
**Coded character sets** are character sets in which each character is associated with a scalar value: a **code point**. For example, in ASCII, the uppercase letter "A" has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be **encoded**, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a **character encoding scheme** or **encoding**. The encoding method maps each character value to a given sequence of bytes.
In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character "A" (code point 65) is encoded as a byte 0x41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character "á" (225) is encoded as two bytes: 0xC3 and 0xA1.
===== Unicode and its encodings =====
For Unicode (also called Universal Character Set or UCS), a coded character set developed by the Unicode consortium, there a several possible encodings: UTF-8, UTF-16, and UTF-32. Of these, UTF-8 is most relevant for a web application.
==== UTF-8 ====
UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes.
One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8.
One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected (more on this below).
See also [[php:i18n:utf-8]]
===== PHP and character sets =====
This page is going to assume you've done a little reading and absorbed some paranioa about the issue of character sets and character encoding in web applications. If you haven't, try [[http://www.joelonsoftware.com/articles/Unicode.html|here]];
> "When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough."
"Darn near impossible" is perhaps too extreme but, certainly in PHP, if you simply "accept the defaults" you probably will end up with all kinds of strange characters and question marks the moment anyone outside the US or Western Europe submits some content to your site
This page won't rehash existing discussions suffice to say you should be thinking in terms of Unicode, the grand unified solution to all character issues and, in particular, UTF-8, a specific encoding of Unicode and the best solution for PHP applications.
==== Everybody Gets it wrong ====
Just so you don't get the idea that only "serious programmers" can understand the problem, and as a taster for the type of problems you can have, right now (i.e. they may fix it later) on IBM's new [[http://www-128.ibm.com/developerworks/blogs/dw_blog.jspa?blog=481&ca=drs-bl|PHP Blog @ developerworks]], here's what you see if you right click > Page Info in Firefox;
{{php:i18n:ibmpageinfo.png}}
Firefox say it regards the character encoding as being [[wp>ISO-8859-1]] ((ISO-8859-1 is basically everything you need to write English and "Western European" languages and is very commonly used on the web still, despite UTF-8.)). That's actually coming from an HTTP header - if you click on the "Headers" tab you see;
Content-Type: text/html;charset=ISO-8859-1
Meanwhile amongst the HTML meta tags (scroll down past the whitespace) though you find;
Now that's not a train smash (yet) but it should raise the flag that something isn't quite right. The meta tag will be ignored by browsers so content will be regarded as being encoded as ISO-8859-1, thanks to the HTTP header.
This begs the question - how is the content on the blog //actually// encoded. If that ''Content-Type: text/html;charset=ISO-8859-1'' header is also turning up in a form that writers on the blog use to submit content, it will probably mean the content being stored will have been encoded as ISO-8859-1. If that's the case, the real problem will raise it's head in the blogs [[http://www-128.ibm.com/developerworks/blogs/dw_blog_rss.jspa?ca=drs-&blog=481|RSS Feed]] which currently does not specify the the charset with an HTTP header - just that it's XML;
Content-Type: text/xml
...//but// does declare UTF as the encoding in the XML content itself;
Anyone subscribed to this feed is going to see some wierd characters appearing, should the blog contain anything but pure ASCII characters, because there's a very good chance the content is actually stored is ISO-8859-1, the guess here being that the "back end" content admin page (containing a form for adding content) is also telling the browser it's ISO-8859-1.
Hopefully, by the time you've read this document, you'll understand what exactly is going wrong here and why.
===== PHP's Problem with Character Encoding =====
The basic problem PHP has with character encoding is it has a very simple idea of what the notion of a character is: that one character equals one byte. Being more precise, the problem is most of PHP's [[http://www.php.net/strings|string related functionality]] (see [[php:i18n:charsets#common_problem_areas_with_utf-8]] for further details) make this assumption but to be able to support a wide range of characters (or all characters, ever, as Unicode does), you need more than one byte to represent a character.
An example in code. From Sam Ruby's [[http://intertwingly.net/stories/2004/04/14/i18n.html|i18n Survival Guide]], he recommends using the string Iñtërnâtiônàlizætiøn for testing. Counted with your eye, you can see it contains 20 characters;
Iñtërnâtiônàlizætiøn
12345678901234567890
But counted with PHP's [[phpfn>strlen]] function...
PHP will report ''27'' characters. That's because the string, encoded as UTF-8, contains multi-byte characters which PHP's [[phpfn>strlen]] function will count as being multiple characters.
Life gets even more interesting if you run the following((Note you should be using a text editor capable of encoding PHP source files as UTF-8 - see [[php:i18n:charsets#useful_tools]]));
You should see something like;
Iñtërnâtiônà lizætiøn
123456789012345678901234567
Which give you an idea of what PHP's string related functionality actually "sees" ((we're talking loose definitions here for humans to grasp - PHP's internal string representations are ultimately "zeros and ones")) when working with this string.
The bottom line is all those string functions you've happily littered all over your code, plus a bunch of other stuff like your use [[http://www.php.net/pcre|regular expressions]] are now in doubt. Is there a character set issue lurking in there, ready to spray strange characters all over your content? The good news is it's really not a big jump to being able to support any and all characters, so long as you make use of UTF-8.
One important point (and more good news) which may not be obvious is PHP doesn't attempt to convert / massage the contents of strings. Even though it's string capabilities don't "understand" anything other than 1 character = 1 byte, PHP won't "mess" with the encoding, leaving it "as is" ((there are exceptions to this of course. PHP's string functions are "generally safe", depending on what you're doing. You need be careful with [[phpfn>strtoupper]] and [[phpfn>strtolower]], for example which are "locale aware" and could mistake UTF-8 characters for those in the current locale. Also the ''\w'' meta character in the [[http://www.php.net/pcre|PCRE]] regular expression extensions is locale dependendent unless the /u modifier is used - see what references [[http://lxr.php.net/ident?i=determine_charset|determine_charset]])). That means, for example;
$some_utf8 = $_POST['comment'];
echo 'Foo '.$some_utf8.' bar'; # note this is VERY bad security - XSS!
$utf8_words = array('Iñtërnâtiônàlizætiøn', 'foo', 'Iñtërnâtiônàlizætiøn');
$utf8_words = implode(' ',$utf8_words);
$utf8_string = 'Iñtërnâtiônàlizætiøn';
print_r(explode('i',$utf8_string));
None of the above will "damage" or alter the character encoding. PHP just passes the strings through blindly.
One more interesting example;
$utf8_string = 'Iñtërnâtiônàlizætiøn';
print_r(explode('à',$utf8_string));
Although it's passing the string ''à'' as the seperator to [[phpfn>explode]], because //well formed// UTF-8 has the property that every sequence is unique, there's no chance the ''à'' will be mistaken for another character, so we can safely explode the string using it.
What may also be a little confusing is PHP scripts themselves can contain more or less any sort of encoding - the PHP //parser// is generally fine with this, although you need to be careful when it comes to the [[wp>Byte_Order_Mark|byte order mark]] (BOM) - see [[http://www.latext.com/pm/comments/A1278_0_1_0_C/|Unicode, WordPress, Panther Server and BBEdit: UTF-8 with or without BOM]]
==== But what about mbstring, iconv etc.? ====
Yep there's PHP extensions to help with character encoding issues but (if you use a shared host, you've probably already got that sinking feeling) they're not enabled by default in PHP4. Two of particular note;
* iconv: The [[http://www.php.net/iconv|iconv]] extension became a default part of PHP5 but it doesn't offer you a magic wand that will make all problems go away. It probably has most value when either migrating old content to UTF-8 or when interfacing with systems can't deliver you US-ASCII, ISO-8859-1 or UTF-8((the modern browsers all do a good job with UTF-8 and support many other character sets as well - they can be more or less trusted to get it right)), such as an RSS feed, your PHP script reads, which is encoded with [[http://www.yale.edu/chinesemac/pages/charset_encoding.html#Big5|BIG5]].
* mbstring: The [[http://www.php.net/mbstring|mbstring]] extension //is// potentially a magic wand, as it provides a mechanism to override a large number of PHP's string functions. Bad news is it's not avaible by default in PHP. Third-hand reports say it used to be pretty unstable but in the last year or so has stabilized (more detail appreciated).
It may be you can take advantage of these extensions in your own environment but if you're writing software for other people to install for themselves, that makes them bad news.
==== And PHP 6? ====
Then all our problems magically vanish ;-) Specifically PHP 6 should have native understanding of Unicode and default to UTF-8 for output as well as a bunch of other stuff, building on the [[wp>International_Components_for_Unicode]] project.
===== Strategy for Handing Character Encoding in PHP Applications =====
So what do you do when the tools you have for the job (PHP) don't provide the facilities you need? You make it someone elses problem. In fact some//thing// else - the web browser. The Firefox and IE (and no doubt Konqueror/Safari as well but can't speak first hand) have excellent support for many different character sets, the most important being UTF-8. All you have to do is tell them "everything is UTF-8" and your problem goes away (well almost).
==== Why UTF-8? ====
What makes UTF-8 special is, first, that it's an encoding of Unicode and, second, that it's backwards compatible with ASCII. From [[http://www.cs.tut.fi/~jkorpela/chars.html#utf|here]];
> Character codes less than 128 (effectively, the ASCII repertoire) are presented "as such", using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 ("bytes with most significant bit set to 0") directly represent ASCII characters, whereas octets in the range 128 - 255 ("bytes with most significant bit set to 1") are to be interpreted as really encoded presentations of characters.
There's some important consequences of this;
- if you have some text encoded as only as ASCII, you can immediately declare it as UTF-8 without needing to convert it
- there's zero likelihood that, when doing things like searching UTF-8 string, with PHP's string functions, that anything that's not an ASCII character could be mistaken for an ASCII character. So %%''strpos($utf_string,'<');''%% won't mistake any other characters, split into their bytes, as being a ''<'' character.
If you're not sure which characters ASCII represents, try [[http://www.lookuptables.com/|here]]
A further special feature of UTF-8 is that in //well formed// UTF-8, no character can be mistaken for another. Put another way, if you have a character that takes four bytes to represent and chop of the last two bytes of that sequence, it cannot be mistaken for another character. Each sequence of bytes starts with an identifier byte using a value which only appears in identifiers bytes. It's easiest to see by examining [[http://en.wikipedia.org/wiki/UTF-8#Description|this table]].
Note the "//well formed//" above. You might also have badly formed UTF-8 and it may be important, in some instances to check for this. [[http://hsivonen.iki.fi/php-utf8/|This PHP library]] is probably the best way to check, being strict and fast. More on validation below.
===== Practical Issues =====
==== Declaring UTF-8 ====
If you're starting development on a new application / site and currently have no content stored (that might be encoded in something other than ASCII or UTF-8), using UTF-8 is just a matter of informing browsers correctly. OK there's more to it than that, depending on what you're actually going to //do// with data you get from a browser, e.g. parsing it, but the first step is letting browsers know.
That can be done by sending the following HTTP header;
// Setting the Content-Type header with charset
header('Content-Type: text/html; charset=utf-8');
**Note:** the value charset should be case insensitive - browsers shouldn't care.
An alternative (which, for overkill, you might want to use as well), is HTML meta tag equivalent to the Content-Type HTTP header;
**Note:** you should use this header as early as possible in the