Unicode Cheat Sheet

Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet


1. Conversion

1.1. How to transform file encoding

Example with PHP files on Linux:

find . -name "*.php"-exec iconv-f ISO-8859-1 -t UTF-8{} -o /path/to/utf8_files/{} \;

1.2. How to transform character encoding in MySQL databases

Procedure [1] (use the INFORMATION_SCHEMA database to build a script automatically):

  • create a temporary, identical structure in a new database,
  • copy all data to that structure,
  • drop the initial structure and 
  • recreate it with the new character encoding:
    • CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
  • Check the maximum length of columns and index keys [2]
  • Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally
  • Drop the temporary structure.

2. Configuration

2.1. HTTP and HTML

In php.ini [1]:

default_charset = UTF-8

or in httpd.conf or .htaccess [5]:
AddDefaultCharset UTF-8

or in the PHP code [5]:
header('Content-type: text/html; charset=UTF-8');

Additionally, put this in you HTML <head> block:
<meta charset=UTF-8"/>

2.2. PHP

In php.ini [1.1]:

mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
mbstring.http_input = auto
mbstring.http_output = UTF-8
mbstring.detect_order = auto
mbstring.substitute_character = none

or in httpd.conf or .htaccess:
php_value <php.ini directive> <value>

or in the PHP code [3] [1]:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_detect_order('auto');
mb_substitute_character('none');

2.3. Verifications [4]

Run this small PHP script:

<?php
 
if ( ! extension_loaded('mbstring'))
  die('mb functions not loaded');
 
if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
  die('PCRE is not compiled with UTF-8 support');
 
exit('ok');
 
?>

3. MySQL Code

3.1. MySQL

Right after each connection, call [5] [2]:

SET NAMES 'utf8';

3.2. Ordering in MySQL

Ordering in MySQL depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the Unicode Character Sets section: http://dev.mysql.com/doc/refma[..]et-unicode-sets.html.

3.3. Stored Procedures and Functions

Old function:

CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
RETURNS varchar(255)
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255);
  ...
  RETURN data;
END;

New function:
CREATE FUNCTION example_function (
IN parameter_name VARCHAR(255)
CHARACTER SET utf8
RETURNS varchar(255)
CHARACTER SET utf8
 READS SQL DATA
BEGIN
DECLARE data VARCHAR(255)
CHARACTER SET utf8;
...
RETURN data; END;



4. PHP Code

4.1. Multibyte string functions [1]


Replace With Notes
strlen() mb_strlen();
strlen();
mb_strwidth();
// How many characters
// How many bytes
// Monotype characters
substr() mb_substr()
strstr()
stristr()
mb_strstr()
mb_stristr()
strrchr() mb_strrchr() [6]
strpos()
stripos()
strrpos()
strripos()
mb_strpos()
mb_stripos()
mb_strrpos()
mb_strripos()
strtolower()
strtoupper()
mb_strtolower()
mb_strtoupper()
substr_count() mb_substr_count() [7]

4.2. String access by character [1]

Search all use of curly or square brackets to extract single characters of strings:
$string{$position} // old syntax
$string[$position] // new syntax

Regular expressions to find them:
/\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/
/\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/

Replace
$char = $string{$pos};

with
$char = mb_substr($string, $pos, 1);

Replace
$string{$pos} = $char;

with
$string = mb_substr($string, 0, $pos). $char. mb_substr($string, $pos + 1);

4.3. UTF-8-safe functions [8]


addslashes()
bin2hex()
explode() [4]
implode()
nl2br()
stripslashes()
strip_tags()
str_repeat()
str_replace() [4]

4.4. Escapement functions

The functions htmlentities() [9] and htmlspecialchars() both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions( mb_*() ), this 3rd parameter is mandatory if not 'ISO-8859-1', no matter what the internal encoding is!


The functions urlencode() and rawurlencode() do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments.

4.5. Comparing strings and sorting arrays

Use the Collator class:
http://www.php.net/manual/en/class.collator.php

4.6. SimpleXML

SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done.

4.7. PRCE functions [1]

Search all PRCE function calls ( preg_* ) and append the /u pattern modifier [10]

4.8. Storable representation of variables

The serialize() and unserialize() functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4].



4.9. String functions that are problematic and for which there is no built-in replacement function


Replace With a function from the PHP UTF8 Library [11] [3] Comment
ord($chr) [12] utf8_ord($chr)
sprintf() The x and X type specifiers could be an issue, according to [4].
str_ireplace($search, $replace,$subject [, &$count]) utf8_ireplace($search, $replace,$subject [, &$count]) Alternatively, write your own implementation using preg_replace().
str_pad($input, $length, $padStr, $type) utf8_str_pad($input, $length, $padStr, $type)
str_split($str, $split_len) utf8_str_split($str, $split_len) Alternatively, use this function: http://www.php.net/manual/ref.mbstring.php#95192
strcasecmp($str1, $str2) Write your own implementation using collator_compare() and mb_strtolower()
strncmp($str1, $str2, $len) Cut the two strings at the specified length, and use collator_compare()
strncasecmp($str1, $str2, $len) Write your own implementation using your replacement of strncmp() and mb_strtolower()
strspn($str1, $str2[, $start[, $len]])
strcspn($str1, $str2[, $start[, $len]])
utf8_strspn($str1, $str2[, $start[, $len]])
utf8_strcspn($str1, $str2[, $start[, $len]])
strrev($string) utf8_strrev($string)
strtr() This function doesn't work if any parameter is UTF-8. Write your own implementation.
substr_replace() utf8_substr_replace()
trim($str, $charlist)
ltrim($str, $charlist)
rtrim($str, $charlist)
utf8_trim($str, $charlist)
utf8_ltrim($str, $charlist)
utf8_rtrim($str, $charlist)
The original functions trim(), ltrim() and rtrim() are UTF-8-safe as long as the 2nd parameter is not used[4].
ucfirst($str)
ucwords($str)
utf8_ucfirst($str)
utf8_ucwords($str)
wordwrap() Write your own implementation

5. Credits

5.1. Sources

[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration, http://www.php.net/manual/en/mbstring.configuration.php
[1.2] A comment about SimpleXML on PHP.net: http://www.php.net/manual/en/ref.simplexml.php#79258
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library, http://sourceforge.net/projects/phputf8
[4] Web Application Component Toolkit, Handling UTF-8 with PHP, http://www.phpwact.org/php/i18n/utf-8
[5] W3C, Setting the HTTP charset parameter, http://www.w3.org/International/O-HTTP-charset.php

5.2. Author and Copyright

Copyright © François Cardinaux 2011


5.3. License

Creative Commons Attribution-Non-Commercial-Share Alike 3.0


Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12


Footnotes:

[1]
Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
[2]
https://dev.mysql.com/doc/refm[..]code-conversion.html
[3]
Some php.ini directives cannot be modified in the PHP code.
[4]
Source: utf8.php in PHP UTF-8 library [3]
[5]
Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
[6]
Note that the mb_strrchr() functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a mb_strrichr() function, which has no equivalent in standard PHP functions.
[7]
Be careful because the 3rd and 4th arguments of substr_count() no longer exist with mb_substr_count(). You can use mb_substr() to circumvent this limitation.
[8]
The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: http://www.php.net/manual/en/collator.compare.php
[9]
The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “whenusing UTF-8, you don’t need entities”.
[10]
However, there may still have problems, as explained in Handling UTF-8 with PHP [4].
[11]
Version 0.5
[12]
Underlined parameters fail if they are UTF-8-encoded

Read comment (1 comment)