Unicode Cheat Sheet

Unicode (UTF-8) with PHP 8.0, MariaDB 10.4 / MySQL 8.0 and HTML5 Cheat Sheet



1. Conversion

1.1. How to transform file encoding

Example with PHP files on Linux:

find . -name "*.php"-exec iconv-f ISO-8859-1 -t UTF-8{} -o /path/to/utf8_files/{} \;	

1.2. How to transform character encoding in MySQL databases

Procedure [1] (use the INFORMATION_SCHEMA database to build a script automatically):

  • create a temporary, identical structure in a new database,
  • copy all data to that structure,
  • drop the initial structure and
  • recreate it with the new character encoding:
    • CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
  • Check the maximum length of columns and index keys [2]
  • Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally
  • Drop the temporary structure.

2. Configuration

2.1. HTTP and HTML

In php.ini [1]:

default_charset = UTF-8	

or in httpd.conf or .htaccess [5]:

AddDefaultCharset UTF-8	

or in the PHP code [5]:

header('Content-type: text/html; charset=UTF-8');	

Additionally, put this in you HTML <head> block:

<meta charset="utf-8">	

2.2. PHP

In php.ini [1.1]:

mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
mbstring.http_input = auto
mbstring.http_output = UTF-8
mbstring.detect_order = auto
mbstring.substitute_character = none	

or in httpd.conf or .htaccess:

php_value <php.ini directive> <value>	

or in the PHP code [3] [1]:

mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_detect_order('auto');
mb_substitute_character('none');	

2.3. Verifications [4]

Run this small PHP script:

<?php
 
if ( ! extension_loaded('mbstring'))
{
    die('mb functions not loaded');
}
 
if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
{
    die('PCRE is not compiled with UTF-8 support');
}
 
exit('ok');

3. MySQL Code

3.1. MySQL

Right after each connection, call [5] [2]:

SET NAMES 'utf8mb4';

SET collation_connection = 'utf8mb4_unicode_520_ci';

3.2. Ordering in MySQL

Ordering in MySQL or MariaDB depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the Unicode Character Sets section.


Examples:

  • utf8mb4_unicode_520_ci is based on UCA 5.2.0 weight keys (UCA 5.2.0),
  • utf8mb4_0900_ai_ci is based on UCA 9.0.0 weight keys (UCA 9.0.0).

3.3. Stored Procedures and Functions

Old function:

CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
RETURNS VARCHAR(255)
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255);
  ...
  RETURN data;
END;

New function:

CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
  CHARACTER SET utf8
RETURNS VARCHAR(255)
  CHARACTER SET utf8mb4
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255)
  CHARACTER SET utf8mb4;
  ...
  RETURN data; 
END;

4. PHP Code

4.1. Multibyte string functions [1]

Replace With Notes
ord() [6] mb_ord()
str_pad() mb_str_pad() PHP 8.3
str_split() mb_str_split()
strlen() mb_strlen()
strlen()
mb_strwidth()
// How many characters
// How many bytes
// Monotype characters
substr() mb_substr()
strstr()
stristr()
mb_strstr()
mb_stristr()
strrchr() mb_strrchr() [7]
strpos()
stripos()
strrpos()
strripos()
mb_strpos()
mb_stripos()
mb_strrpos()
mb_strripos()
strtolower()
strtoupper()
mb_strtolower()
mb_strtoupper()
substr_count() mb_substr_count() [8]
trim
ltrim
rtrim
mb_trim
mb_ltrim
mb_rtrim
PHP 8.4

4.2. String access by character [1]

Search all use of curly or square brackets to extract single characters of strings:

$string{$position} // old syntax
$string[$position] // new syntax	

Regular expressions to find them:

/\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/
/\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/	

Replace

$char = $string[$pos];	

with

$char = mb_substr($string, $pos, 1);	

Replace

$string[$pos] = $char;	

with

$string = mb_substr($string, 0, $pos) . $char . mb_substr($string, $pos + 1);	

4.3. UTF-8-safe functions [9]


addslashes()
bin2hex()
explode() [4]
implode()
nl2br()
stripslashes()
strip_tags()
str_repeat()
str_replace() [4]

4.4. Escapement functions

The functions htmlentities() [10] and htmlspecialchars() both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions( mb_*() ), this 3rd parameter is mandatory if not 'UTF-8', no matter what the internal encoding is!


The functions urlencode() and rawurlencode() do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments.

4.5. Comparing strings and sorting arrays

Use the Collator class:
https://www.php.net/manual/en/class.collator.php

4.6. SimpleXML

SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done.

4.7. PRCE functions [1]

Search all PRCE function calls ( preg_* ) and append the /u pattern modifier [11]

4.8. Storable representation of variables

The serialize() and unserialize() functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4].



4.9. String functions that are problematic and for which there is no built-in replacement function

Replace With a function from the PHP UTF8 Library [3] Comment
count_chars($string, $mode) This function doesn't work if primary parameter is UTF-8. Write your own implementation.
sprintf() The x and X type specifiers could be an issue, according to [4].
str_ireplace($search, $replace, $subject [, &$count]) utf8_ireplace($search, $replace, $subject [, &$count]) Alternatively, write your own implementation using preg_replace().
str_pad($input, $length, $padStr, $type) utf8_str_pad($input, $length, $padStr, $type) mb_str_pad() will be available with PHP 8.3
strcasecmp($str1, $str2) Write your own implementation using collator_compare() and mb_strtolower()
strncmp($str1, $str2, $len) Cut the two strings at the specified length, and use collator_compare()
strncasecmp($str1, $str2, $len) Write your own implementation using your replacement of strncmp() and mb_strtolower()
strspn($str1, $str2[, $start[, $len]])
strcspn($str1, $str2[, $start[, $len]])
utf8_strspn($str1, $str2[, $start[, $len]])
utf8_strcspn($str1, $str2[, $start[, $len]])
strrev($string) utf8_strrev($string)
strtr() This function doesn't work if any parameter is UTF-8. Write your own implementation.
substr_replace() utf8_substr_replace()
trim($str, $charlist)
ltrim($str, $charlist)
rtrim($str, $charlist)
utf8_trim($str, $charlist)
utf8_ltrim($str, $charlist)
utf8_rtrim($str, $charlist)
The original functions trim(), ltrim() and rtrim() are UTF-8-safe as long as the 2nd parameter is not used[4].
mb_trim, mb_ltrim and mb_rtrim will be available with PHP 8.4
lcfirst($str)
ucfirst($str)
ucwords($str)

utf8_ucfirst($str)
utf8_ucwords($str)
mb_lcfirst()
mb_ucfirst()
will be available with PHP 8.4
wordwrap() Write your own implementation

4.10. Inline UTF-8 literal syntax

Unicode codepoint escape syntax \u, e.g. \u{[0-9A-Fa-f]+}.


This takes a Unicode codepoint in hexadecimal form, and outputs that codepoint in UTF-8 to a double-quoted string.


https://wiki.php.net/rfc/unicode_escape

5. Sources

[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration, https://www.php.net/manual/en/[...]ng.configuration.php
[1.2] A comment about SimpleXML on PHP.net: https://www.php.net/manual/en/ref.simplexml.php#79258
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library, https://sourceforge.net/projects/phputf8
[4] Web Application Component Toolkit, Handling UTF-8 with PHP, http://www.phpwact.org/php/i18n/utf-8
[5] W3C, Setting the HTTP charset parameter, https://www.w3.org/International/O-HTTP-charset.php


Footnotes:

[1]
Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
[2]
https://dev.mysql.com/doc/refm[...]code-conversion.html
[3]
Some php.ini directives cannot be modified in the PHP code.
[4]
Source: utf8.php in PHP UTF-8 library [3]
[5]
Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
[6]
Underlined parameters fail if they are UTF-8-encoded
[7]
Note that the mb_strrchr() functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a mb_strrichr() function, which has no equivalent in standard PHP functions.
[8]
Be careful because the 3rd and 4th arguments of substr_count() no longer exist with mb_substr_count(). You can use mb_substr() to circumvent this limitation.
[9]
The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: https://www.php.net/manual/en/collator.compare.php
[10]
The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when using UTF-8, you don’t need entities”.
[11]
However, there may still have problems, as explained in Handling UTF-8 with PHP [4].