Unicode Cheat Sheet

Unicode (UTF-8) with PHP 8, MariaDB / MySQL and HTML5 Cheat Sheet

1. Conversion

1.1. How to transform file encoding

Example with PHP files on Linux:

find . -name "*.php"-exec iconv-f ISO-8859-1 -t UTF-8{} -o /path/to/utf8_files/{} \;

1.2. How to transform character encoding in MySQL databases

Procedure ^[1] (use the INFORMATION_SCHEMA database to build a script automatically):

create a temporary, identical structure in a new database,
copy all data to that structure,
drop the initial structure and

recreate it with the new character encoding:

CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci

Check the maximum length of columns and index keys ^[2]
Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally
Drop the temporary structure.

2. Configuration

2.1. HTTP and HTML

In php.ini [1]:

default_charset = UTF-8

or in httpd.conf or .htaccess [5]:

AddDefaultCharset UTF-8

or in the PHP code [5]:

header('Content-type: text/html; charset=UTF-8');

Additionally, put this in you HTML <head> block:

<meta charset="utf-8">

2.2. PHP

In php.ini [1.1]:

mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
mbstring.http_input = auto
mbstring.http_output = UTF-8
mbstring.detect_order = auto
mbstring.substitute_character = none

or in httpd.conf or .htaccess:

php_value <php.ini directive> <value>

or in the PHP code ^[3] [1]:

mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_detect_order('auto');
mb_substitute_character('none');

2.3. Verifications ^[4]

Run this small PHP script:

<?php

if ( ! extension_loaded('mbstring'))
{
    die('mb functions not loaded');
}

if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
{
    die('PCRE is not compiled with UTF-8 support');
}

exit('ok');

3. MySQL Code

3.1. MySQL

Right after each connection, call ^[5] [2]:

SET NAMES 'utf8mb4';

SET collation_connection = 'utf8mb4_unicode_520_ci';

3.2. Ordering in MySQL

Ordering in MySQL or MariaDB depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the Unicode Character Sets^[link1] section.

Examples:

utf8mb4_unicode_520_ci is based on UCA 5.2.0 weight keys (UCA 5.2.0^[link2]),
utf8mb4_0900_ai_ci is based on UCA 9.0.0 weight keys (UCA 9.0.0^[link3]).

3.3. Stored Procedures and Functions

Old function:

CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
RETURNS VARCHAR(255)
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255);
  ...
  RETURN data;
END;

New function:

CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
  CHARACTER SET utf8
RETURNS VARCHAR(255)
  CHARACTER SET utf8mb4
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255)
  CHARACTER SET utf8mb4;
  ...
  RETURN data; 
END;

4. PHP Code

4.1. Multibyte string functions [1]

Replace	With	Notes
`lcfirst()` `ucfirst()`	`mb_lcfirst()` `mb_ucfirst()`	PHP 8.4
`ord()` [[	Underlined parameters fail if they are UTF-8-encoded]]	`mb_ord()`
`str_pad()`	`mb_str_pad()`	PHP 8.3
`str_split()`	`mb_str_split()`
`strlen()`	`mb_strlen()` `strlen()` `mb_strwidth()`	// How many characters // How many bytes // Monotype characters
`substr()`	`mb_substr()`
`strstr()` `stristr()`	`mb_strstr()` `mb_stristr()`
`strrchr()`	`mb_strrchr()` [[	Note that the `mb_strrchr()` functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a `mb_strrichr()` function, which has no equivalent in standard PHP functions.]]
`strpos()` `stripos()` `strrpos()` `strripos()`	`mb_strpos()` `mb_stripos()` `mb_strrpos()` `mb_strripos()`
`strtolower()` `strtoupper()`	`mb_strtolower()` `mb_strtoupper()`
`substr_count()`	`mb_substr_count()` [[	Be careful because the 3rd and 4th arguments of `substr_count()` no longer exist with `mb_substr_count()`. You can use `mb_substr()` to circumvent this limitation.]]
`trim` `ltrim` `rtrim`	`mb_trim` `mb_ltrim` `mb_rtrim`	PHP 8.4

4.2. String access by character [1]

Search all use of curly or square brackets to extract single characters of strings:

$string{$position} // old syntax
$string[$position] // new syntax

Regular expressions to find them:

/\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/
/\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/

Replace

$char = $string[$pos];

with

$char = mb_substr($string, $pos, 1);

Replace

$string[$pos] = $char;

with

$string = mb_substr($string, 0, $pos) . $char . mb_substr($string, $pos + 1);

4.3. UTF-8-safe functions ^[9]

addslashes()

bin2hex()

explode() [4]

implode()

nl2br()

stripslashes()

strip_tags()

str_repeat()

str_replace() [4]

4.4. Escapement functions

The functions htmlentities() ^[7] and htmlspecialchars() both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions( mb_*() ), this 3rd parameter is mandatory if not 'UTF-8', no matter what the internal encoding is!

The functions urlencode() and rawurlencode() do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments.

4.5. Comparing strings and sorting arrays

Use the Collator class:
https://www.php.net/manual/en/class.collator.php

4.6. SimpleXML

SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done.

4.7. PRCE functions [1]

Search all PRCE function calls ( preg_* ) and append the /u pattern modifier ^[8]

4.8. Storable representation of variables

The serialize() and unserialize() functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4].

4.9. String functions that are problematic and for which there is no built-in replacement function

Replace	With a function from the PHP UTF8 Library [3]	Comment
`count_chars($string, $mode)`		This function doesn't work if primary parameter is UTF-8. Write your own implementation.
`sprintf()`		The x and X type specifiers could be an issue, according to [4].
`str_ireplace($search, $replace, $subject [, &$count])`	`utf8_ireplace($search, $replace, $subject [, &$count])`	Alternatively, write your own implementation using `preg_replace()`.
`str_pad($input, $length, $padStr, $type)`	`utf8_str_pad($input, $length, $padStr, $type)`	`mb_str_pad()` available with PHP 8.3
`strcasecmp($str1, $str2)`		Write your own implementation using `collator_compare()` and `mb_strtolower()`
`strncmp($str1, $str2, $len)`		Cut the two strings at the specified length, and use `collator_compare()`
`strncasecmp($str1, $str2, $len)`		Write your own implementation using your replacement of `strncmp()` and `mb_strtolower()`
`strspn($str1, $str2[, $start[, $len]])` `strcspn($str1, $str2[, $start[, $len]])`	`utf8_strspn($str1, $str2[, $start[, $len]])` `utf8_strcspn($str1, $str2[, $start[, $len]])`
`strrev($string)`	`utf8_strrev($string)`
`strtr()`		This function doesn't work if any parameter is UTF-8. Write your own implementation.
`substr_replace()`	`utf8_substr_replace()`
`trim($str, $charlist)` `ltrim($str, $charlist)` `rtrim($str, $charlist)`	`utf8_trim($str, $charlist)` `utf8_ltrim($str, $charlist)` `utf8_rtrim($str, $charlist)`	The original functions `trim()`, `ltrim()` and `rtrim()` are UTF-8-safe as long as the 2nd parameter is not used[4]. `mb_trim`, `mb_ltrim` and `mb_rtrim` available with PHP 8.4
`lcfirst($str)` `ucfirst($str)` `ucwords($str)`	`utf8_ucfirst($str)` `utf8_ucwords($str)`	`mb_lcfirst()` `mb_ucfirst()` available with PHP 8.4
`wordwrap()`		Write your own implementation

4.10. Inline UTF-8 literal syntax

Unicode codepoint escape syntax \u, e.g. \u{[0-9A-Fa-f]+}.

This takes a Unicode codepoint in hexadecimal form, and outputs that codepoint in UTF-8 to a double-quoted string.

https://wiki.php.net/rfc/unicode_escape

5. Sources

[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration, https://www.php.net/manual/en/[...]ng.configuration.php^[link4]
[1.2] A comment about SimpleXML on PHP.net: https://www.php.net/manual/en/ref.simplexml.php#79258
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library, https://sourceforge.net/projects/phputf8
[4] Web Application Component Toolkit, Handling UTF-8 with PHP, http://www.phpwact.org/php/i18n/utf-8^[link5]
[5] W3C, Setting the HTTP charset parameter, https://www.w3.org/International/O-HTTP-charset.php

Footnotes:

[1]: Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
[2]: https://dev.mysql.com/doc/refm[...]code-conversion.html^[link6]
[3]: Some php.ini directives cannot be modified in the PHP code.
[4]: Source: utf8.php in PHP UTF-8 library [3]
[5]: Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
[6]: The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: https://www.php.net/manual/en/collator.compare.php
[7]: The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when using UTF-8, you don’t need entities”.
[8]: However, there may still have problems, as explained in Handling UTF-8 with PHP [4].

Links

[link1] https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
[link2] http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt
[link3] http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt
[link4] https://www.php.net/manual/en/mbstring.configuration.php
[link5] https://web.archive.org/web/20070319193643/http://www.phpwact.org/php/i18n/utf-8
[link6] https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-conversion.html