Recent changes RSS feed
 

MySQL and UTF-8

Notes

Good support from 4.1

utf-8 is utf8 in MySQL.

A collation defines the sort order for the data, it may be case sensitive or not

To find out your current setup:

SHOW VARIABLES LIKE 'character_set_database';

SHOW VARIABLES LIKE 'character_set_client';

To see available character sets and collations on your database:

SHOW CHARACTER SET;

SHOW COLLATION LIKE 'utf8%';

Character set and collation can be set per server, database, table, connection;

Server (/etc/my.cnf):

[mysqld]
 ... 
default-character-set=utf8 
default-collation=utf8_general_ci 

Database:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8

Table:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8

Connection:

SET NAMES 'utf8';

A PHP mysql connection (not totally confirmed, but see tests below) defaults to a latin1 connection, so, your first query after connection should be:

mysql_query("SET NAMES 'utf8'");

In php versions 5.2 and later, use

mysql_set_charset('utf8',$conn); 

The CONVERT() function can convert between charsets, eg:

INSERT INTO utf8table (utf8column) 
SELECT CONVERT(latin1field USING utf8)
FROM latin1table; 

As mentioned in charsets, field widths may need to be increased to deal with multi-byte characters.

Code to generate a mass change of collations:

<?php
 
// this script will output the queries need to change all fields/tables to a different collation
// it is HIGHLY suggested you take a MySQL dump prior to running any of the generated
// this code is provided as is and without any warranty
 
die("Make a backup of your MySQL database then remove this line");
 
set_time_limit(0);
 
// collation you want to change:
$convert_from = 'latin1_swedish_ci';
 
// collation you want to change it to:
$convert_to   = 'utf8_general_ci';
 
// character set of new collation:
$character_set= 'utf8';
 
$show_alter_table = true;
$show_alter_field = true;
 
// DB login information
$username = 'user';
$password = 'pass';
$database = 'table';
$host     = 'localhost';
 
mysql_connect($host, $username, $password);
mysql_select_db($database);
 
$rs_tables = mysql_query(" SHOW TABLES ") or die(mysql_error());
 
print '<pre>';
while ($row_tables = mysql_fetch_row($rs_tables)) {
    $table = mysql_real_escape_string($row_tables[0]);
    
    // Alter table collation
    // ALTER TABLE `account` DEFAULT CHARACTER SET utf8
    if ($show_alter_table) {
        echo("ALTER TABLE `$table` DEFAULT CHARACTER SET $character_set;\r\n");
    }
 
    $rs = mysql_query(" SHOW FULL FIELDS FROM `$table` ") or die(mysql_error());
    while ($row=mysql_fetch_assoc($rs)) {
        
        if ($row['Collation']!=$convert_from)
            continue;
 
        // Is the field allowed to be null?
        if ($row['Null']=='YES') {
            $nullable = ' NULL ';
        } else {
            $nullable = ' NOT NULL';
        }
 
        // Does the field default to null, a string, or nothing?
        if ($row['Default']=='NULL') {
            $default = " DEFAULT NULL";
        } else if ($row['Default']!='') {
            $default = " DEFAULT '".mysql_real_escape_string($row['Default'])."'";
        } else {
            $default = '';
        }
 
        // Alter field collation:
        // ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL
        if ($show_alter_field) {
            $field = mysql_real_escape_string($row['Field']);
            echo "ALTER TABLE `$table` CHANGE `$field` `$field` $row[Type] CHARACTER SET $character_set COLLATE $convert_to $nullable $default; \r\n";
        }
    }
}
 
?>

MySQL tables, columns and connections with UTF-8

What happens when you INSERT a UTF-8 string into a MySQL database using PHP‘s MySQL extension? Well that depends on what string you use. If you use something like “Iñtërnâtiônàlizætiøn” then doing just about anything in MySQL will be data-safe (but, of course, the collation will be incorrect). If you have characters that don’t encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behaviour depends on both the table definition and the encoding of the connection to the database.

Table or column charset
latin1 (MySQL default) utf8
default connection mysql_pconnect() data is binary-safe but not encoded properly* data is binary-safe but not encoded properly*
using SET NAMES ‘utf8’; data destroyed; string is converted to “????” works fine!

* It seems that MySQL will look at the data as being a series of bytes all within the latin1 codepage. If you use the same code to read and write the data then it will round-trip fine (but MySQL’s collation will obviously be wrong). If you use tools with a knowledge of the data types used by MySQL, like phpMyAdmin or even mysql_dump, then the wrong encoding is obvious.

(this test was performed with PHP versions 4.4.2 and 5.1.2 and MySQL 5.0.20; PHP source for test)

Further Reading

 
php/i18n/utf-8/mysql.txt · Last modified: 2010/01/14 06:59 by jeff
 
Hosting for this site donated by Procata PHP Development