Unicode Cheat Sheet
Unicode (UTF-8) with PHP 8.0, MariaDB 10.4 / MySQL 8.0 and HTML5 Cheat Sheet
1. Conversion
1.1. How to transform file encoding
Example with PHP files on Linux:
find . -name "*.php"-exec iconv-f ISO-8859-1 -t UTF-8{} -o /path/to/utf8_files/{} \;
1.2. How to transform character encoding in MySQL databases
Procedure [1] (use the INFORMATION_SCHEMA database to build a script automatically):
- create a temporary, identical structure in a new database,
- copy all data to that structure,
- drop the initial structure and
- recreate it with the new character encoding:
-
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
-
- Check the maximum length of columns and index keys [2]
- Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally
- Drop the temporary structure.
2. Configuration
2.1. HTTP and HTML
In php.ini [1]:
default_charset = UTF-8
or in httpd.conf or .htaccess [5]:
AddDefaultCharset UTF-8
or in the PHP code [5]:
header('Content-type: text/html; charset=UTF-8');
Additionally, put this in you HTML <head> block:
<meta charset="utf-8">
2.2. PHP
In php.ini [1.1]:
mbstring.language = Neutral mbstring.internal_encoding = UTF-8 mbstring.encoding_translation = On mbstring.http_input = auto mbstring.http_output = UTF-8 mbstring.detect_order = auto mbstring.substitute_character = none
or in httpd.conf or .htaccess:
php_value <php.ini directive> <value>
or in the PHP code [3] [1]:
mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_detect_order('auto'); mb_substitute_character('none');
2.3. Verifications [4]
Run this small PHP script:
if ( ! extension_loaded('mbstring')) { die('mb functions not loaded'); } if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar)) { die('PCRE is not compiled with UTF-8 support'); } exit('ok');
3. MySQL Code
3.1. MySQL
Right after each connection, call [5] [2]:
SET NAMES 'utf8mb4';
SET collation_connection = 'utf8mb4_unicode_520_ci';
3.2. Ordering in MySQL
Ordering in MySQL or MariaDB depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the Unicode Character Sets section.
Examples:
-
utf8mb4_unicode_520_ci
is based on UCA 5.2.0 weight keys (UCA 5.2.0), -
utf8mb4_0900_ai_ci
is based on UCA 9.0.0 weight keys (UCA 9.0.0).
3.3. Stored Procedures and Functions
Old function:
CREATE FUNCTION example_function ( IN parameter_name VARCHAR(255) RETURNS VARCHAR(255) READS SQL DATA BEGIN DECLARE data VARCHAR(255); ... RETURN data; END;
New function:
CREATE FUNCTION example_function ( IN parameter_name VARCHAR(255) CHARACTER SET utf8 RETURNS VARCHAR(255) CHARACTER SET utf8mb4 READS SQL DATA BEGIN DECLARE data VARCHAR(255) CHARACTER SET utf8mb4; ... RETURN data; END;
4. PHP Code
4.1. Multibyte string functions [1]
Replace | With | Notes |
---|---|---|
ord() [6] | mb_ord() | |
str_pad() | mb_str_pad() | PHP 8.3 |
str_split() | mb_str_split() | |
strlen() | mb_strlen() strlen() mb_strwidth() | // How many characters // How many bytes // Monotype characters |
substr() | mb_substr() | |
strstr() stristr() | mb_strstr() mb_stristr() | |
strrchr() | mb_strrchr() [7] | |
strpos() stripos() strrpos() strripos() | mb_strpos() mb_stripos() mb_strrpos() mb_strripos() | |
strtolower() strtoupper() | mb_strtolower() mb_strtoupper() | |
substr_count() | mb_substr_count() [8] | |
trim ltrim rtrim | mb_trim mb_ltrim mb_rtrim | PHP 8.4 |
4.2. String access by character [1]
Search all use of curly or square brackets to extract single characters of strings:
$string{$position} // old syntax $string[$position] // new syntax
Regular expressions to find them:
/\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/ /\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/
Replace
$char = $string[$pos];
with
$char = mb_substr($string, $pos, 1);
Replace
$string[$pos] = $char;
with
$string = mb_substr($string, 0, $pos) . $char . mb_substr($string, $pos + 1);
4.3. UTF-8-safe functions [9]
addslashes() |
bin2hex() |
explode() [4] |
implode() |
nl2br() |
stripslashes() |
strip_tags() |
str_repeat() |
str_replace() [4] |
4.4. Escapement functions
The functions htmlentities()
[10] and htmlspecialchars()
both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions( mb_*()
), this 3rd parameter is mandatory if not 'UTF-8'
, no matter what the internal encoding is!
The functions urlencode()
and rawurlencode()
do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments.
4.5. Comparing strings and sorting arrays
Use the Collator class:
https://www.php.net/manual/en/class.collator.php
4.6. SimpleXML
SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done.
4.7. PRCE functions [1]
Search all PRCE function calls ( preg_*
) and append the /u
pattern modifier [11]
4.8. Storable representation of variables
The serialize()
and unserialize()
functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4].
4.9. String functions that are problematic and for which there is no built-in replacement function
Replace | With a function from the PHP UTF8 Library [3] | Comment |
---|---|---|
count_chars($string, $mode) | This function doesn't work if primary parameter is UTF-8. Write your own implementation. | |
sprintf() | The x and X type specifiers could be an issue, according to [4]. | |
str_ireplace($search, $replace, $subject [, &$count]) | utf8_ireplace($search, $replace, $subject [, &$count]) | Alternatively, write your own implementation using preg_replace() . |
str_pad($input, $length, $padStr, $type) | utf8_str_pad($input, $length, $padStr, $type) | mb_str_pad() will be available with PHP 8.3 |
strcasecmp($str1, $str2) | Write your own implementation using collator_compare() and mb_strtolower() | |
strncmp($str1, $str2, $len) | Cut the two strings at the specified length, and use collator_compare() | |
strncasecmp($str1, $str2, $len) | Write your own implementation using your replacement of strncmp() and mb_strtolower() | |
strspn($str1, $str2[, $start[, $len]]) strcspn($str1, $str2[, $start[, $len]]) | utf8_strspn($str1, $str2[, $start[, $len]]) utf8_strcspn($str1, $str2[, $start[, $len]]) | |
strrev($string) | utf8_strrev($string) | |
strtr() | This function doesn't work if any parameter is UTF-8. Write your own implementation. | |
substr_replace() | utf8_substr_replace() | |
trim($str, $charlist) ltrim($str, $charlist) rtrim($str, $charlist) | utf8_trim($str, $charlist) utf8_ltrim($str, $charlist) utf8_rtrim($str, $charlist) | The original functions trim() , ltrim() and rtrim() are UTF-8-safe as long as the 2nd parameter is not used[4]. mb_trim , mb_ltrim and mb_rtrim will be available with PHP 8.4 |
lcfirst($str) ucfirst($str) ucwords($str) | utf8_ucfirst($str) utf8_ucwords($str) | mb_lcfirst() mb_ucfirst() will be available with PHP 8.4 |
wordwrap() | Write your own implementation |
4.10. Inline UTF-8 literal syntax
Unicode codepoint escape syntax \u
, e.g. \u{[0-9A-Fa-f]+}
.
This takes a Unicode codepoint in hexadecimal form, and outputs that codepoint in UTF-8 to a double-quoted string.
https://wiki.php.net/rfc/unicode_escape
5. Sources
[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration, https://www.php.net/manual/en/[...]ng.configuration.php
[1.2] A comment about SimpleXML on PHP.net: https://www.php.net/manual/en/ref.simplexml.php#79258
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library, https://sourceforge.net/projects/phputf8
[4] Web Application Component Toolkit, Handling UTF-8 with PHP, http://www.phpwact.org/php/i18n/utf-8
[5] W3C, Setting the HTTP charset parameter, https://www.w3.org/International/O-HTTP-charset.php
Footnotes:
- [1]
- Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
- [2]
- https://dev.mysql.com/doc/refm[...]code-conversion.html
- [3]
- Some php.ini directives cannot be modified in the PHP code.
- [4]
- Source: utf8.php in PHP UTF-8 library [3]
- [5]
- Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
- [6]
- Underlined parameters fail if they are UTF-8-encoded
- [7]
- Note that the
mb_strrchr()
functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is amb_strrichr()
function, which has no equivalent in standard PHP functions. - [8]
- Be careful because the 3rd and 4th arguments of
substr_count()
no longer exist withmb_substr_count()
. You can usemb_substr()
to circumvent this limitation. - [9]
- The
strcmp()
function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: https://www.php.net/manual/en/collator.compare.php - [10]
- The function
htmlentities()
converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when using UTF-8, you don’t need entities”. - [11]
- However, there may still have problems, as explained in Handling UTF-8 with PHP [4].