Unicode Cheat Sheet (source)

View source for Unicode Cheat Sheet

Unicode (UTF-8) with PHP 8, MariaDB / MySQL and HTML5 Cheat Sheet

{{toc numerate=1}}

===Conversion===

====How to transform file encoding====
Example with PHP files on Linux:
%%
find . -name "*.php"-exec iconv-f ISO-8859-1 -t UTF-8{} -o /path/to/utf8_files/{} \;
%%

====How to transform character encoding in MySQL databases====
Procedure [[^ Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.]]  (use the INFORMATION_SCHEMA database to build a script automatically):
  * create a temporary, identical structure in a new database,
  * copy all data to that structure,
  * drop the initial structure and
  * recreate it with the new character encoding:
    * %%(hl sql)CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci%%
  * Check the maximum length of columns and index keys [[^ https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-conversion.html]]
  * Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally 
  * Drop the temporary structure.

===Configuration===

====HTTP and HTML====
In php.ini [1]:
%%default_charset = UTF-8%%
or in httpd.conf or .htaccess [5]:
%%AddDefaultCharset UTF-8%%
or in the PHP code [5]:
%%header('Content-type: text/html; charset=UTF-8');%%
Additionally, put this in you HTML <head> block:
%%<meta charset="utf-8">%%

====PHP====
In php.ini [1.1]:
%%
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
mbstring.http_input = auto
mbstring.http_output = UTF-8
mbstring.detect_order = auto
mbstring.substitute_character = none
%%
or in httpd.conf or .htaccess:
%%php_value <php.ini directive> <value>%%
or in the PHP code [[^ Some php.ini directives cannot be modified in the PHP code.]] [1]:
%%
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_detect_order('auto');
mb_substitute_character('none');
%%

====Verifications [[^ Source: utf8.php in PHP UTF-8 library [3] ]]====
Run this small PHP script:
%%(hl php)
<?php

if ( ! extension_loaded('mbstring'))
{
	die('mb functions not loaded');
}

if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
{
	die('PCRE is not compiled with UTF-8 support');
}

exit('ok');
%%

===MySQL Code===

====MySQL====
Right after each connection, call [[^ Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, **no matter what the actual encoding is**. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.]] [2]:
%%(hl sql)
SET NAMES 'utf8mb4';
%%
%%(hl sql)
SET collation_connection = 'utf8mb4_unicode_520_ci';
%%

====Ordering in MySQL====
Ordering in MySQL or MariaDB depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the ((https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html Unicode Character Sets)) section.

Examples:
  * ##utf8mb4_unicode_520_ci## is based on UCA 5.2.0 weight keys (((http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt UCA 5.2.0))),
  * ##utf8mb4_0900_ai_ci## is based on UCA 9.0.0 weight keys (((http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt UCA 9.0.0))). 

====Stored Procedures and Functions====
Old function:
%%(hl sql)
CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
RETURNS VARCHAR(255)
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255);
  ...
  RETURN data;
END;
%%
New function:
%%(hl sql)
CREATE FUNCTION example_function (
  IN parameter_name VARCHAR(255)
  CHARACTER SET utf8
RETURNS VARCHAR(255)
  CHARACTER SET utf8mb4
  READS SQL DATA
BEGIN
  DECLARE data VARCHAR(255)
  CHARACTER SET utf8mb4;
  ...
  RETURN data; 
END;
%%

===PHP Code===
====Multibyte string functions [1]====
#|
*| Replace | With | Notes |*
|| ##lcfirst()##
##ucfirst()## | ##mb_lcfirst()##
##mb_ucfirst()## | PHP 8.4 ||
|| ##ord()## [[^ Underlined parameters fail if they are UTF-8-encoded]] | ##mb_ord()## |  ||
|| ##str_pad()## | ##mb_str_pad()## | PHP 8.3 ||
|| ##str_split()## | ##mb_str_split()## |  ||
|| ##strlen()## | ##mb_strlen()##
##strlen()##
##mb_strwidth()##  | // How many characters
// How many bytes
// Monotype characters||
|| ##substr()## | ##mb_substr()## | ||
|| ##strstr()##
##stristr()## | ##mb_strstr()##
##mb_stristr()## | ||
|| ##strrchr()## | ##mb_strrchr()## [[^ Note that the ##mb_strrchr()## functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a ##mb_strrichr()## function, which has no equivalent in standard PHP functions.]] | ||
|| ##strpos()##
##stripos()##
##strrpos()##
##strripos()## | ##mb_strpos()##
##mb_stripos()##
##mb_strrpos()##
##mb_strripos()## | ||
|| ##strtolower()##
##strtoupper()## | ##mb_strtolower()##
##mb_strtoupper()## | ||
|| ##substr_count()## | ##mb_substr_count()## [[^ Be careful because the 3rd and 4th arguments of ##substr_count()## no longer exist with ##mb_substr_count()##. You can use ##mb_substr()## to circumvent this limitation.]] | ||
|| ##trim## 
##ltrim## 
##rtrim## | ##mb_trim## 
##mb_ltrim## 
##mb_rtrim##  | PHP 8.4 ||
|#

====String access by character [1]====
Search all use of curly or square brackets to extract single characters of strings:
%%
$string{$position} // old syntax
$string[$position] // new syntax
%%
Regular expressions to find them:
%%
/\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/
/\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/
%%
Replace
%%
$char = $string[$pos];
%%
with
%%
$char = mb_substr($string, $pos, 1);
%%
Replace
%%
$string[$pos] = $char;
%%
with
%%
$string = mb_substr($string, 0, $pos) . $char . mb_substr($string, $pos + 1);
%%

====UTF-8-safe functions [[^ The ##strcmp()## function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: https://www.php.net/manual/en/collator.compare.php]]====

#|
|| ##addslashes()## ||
|| ##bin2hex()## ||
|| ##explode()## [4] ||
|| ##implode()## ||
|| ##nl2br()## ||
|| ##stripslashes()## ||
|| ##strip_tags()## ||
|| ##str_repeat()## ||
|| ##str_replace()## [4] ||
|#

====Escapement functions====
The functions ##htmlentities()## [[^ The function ##htmlentities()## converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when using UTF-8, you don’t need entities”.]] and ##htmlspecialchars()## both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions( ##mb_*()## ), this **3rd parameter is mandatory** if not ##'UTF-8'##, no matter what the internal encoding is!

The functions ##urlencode()## and ##rawurlencode()## do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments.

====Comparing strings and sorting arrays====
Use the Collator class:
https://www.php.net/manual/en/class.collator.php

====SimpleXML====
SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done.

====PRCE functions [1]====
Search all PRCE function calls ( ##preg_*## ) and append the ##/u## pattern modifier [[^ However, there may still have problems, as explained in Handling UTF-8 with PHP [4].]]

====Storable representation of variables====
The ##serialize()## and ##unserialize()## functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4].

----

====String functions that are problematic and for which there is no built-in replacement function====
#|
*| Replace | With a function from the PHP UTF8 Library [3] | Comment |*
|| ##count_chars($string, $mode)## | |This function doesn't work if primary parameter is UTF-8. Write your own implementation. ||
|| ##sprintf()## | |The x and X type specifiers could be an issue, according to [4]. ||
|| ##str_ireplace($search, $replace, $subject [, &$count])## | ##utf8_ireplace($search, $replace, $subject [, &$count])## |Alternatively, write your own implementation using ##preg_replace()##. ||
|| ##str_pad($input, $length, $padStr, $type)## | ##utf8_str_pad($input, $length, $padStr, $type)## | ##mb_str_pad()##  available with PHP 8.3  ||
|| ##strcasecmp($str1, $str2)## | | Write your own implementation using ##collator_compare()## and ##mb_strtolower()## ||
|| ##strncmp($str1, $str2, $len)## | | Cut the two strings at the specified length, and use ##collator_compare()## ||
|| ##strncasecmp($str1, $str2, $len)## | | Write your own implementation using your replacement of ##strncmp()## and ##mb_strtolower()## ||
|| ##strspn($str1, $str2[, $start[, $len]])##
##strcspn($str1, $str2[, $start[, $len]])## | ##utf8_strspn($str1, $str2[, $start[, $len]])##
##utf8_strcspn($str1, $str2[, $start[, $len]])## | ||
|| ##strrev($string)## | ##utf8_strrev($string)## | ||
|| ##strtr()## | |This function doesn't work if any parameter is UTF-8. Write your own implementation. ||
|| ##substr_replace()## | ##utf8_substr_replace()## | ||
|| ##trim($str, $charlist)##
##ltrim($str, $charlist)##
##rtrim($str, $charlist)## | ##utf8_trim($str, $charlist)##
##utf8_ltrim($str, $charlist)##
##utf8_rtrim($str, $charlist)## | The original functions ##trim()##, ##ltrim()## and ##rtrim()## are UTF-8-safe as long as the 2nd parameter is not used[4]. 
##mb_trim##, ##mb_ltrim## and ##mb_rtrim## available with PHP 8.4 ||
|| ##lcfirst($str)##
##ucfirst($str)##
##ucwords($str)## | 
##utf8_ucfirst($str)##
##utf8_ucwords($str)## | ##mb_lcfirst()##
##mb_ucfirst()##
available with PHP 8.4 ||
|| ##wordwrap()## | |Write your own implementation ||
|#

====Inline UTF-8 literal syntax ====
Unicode codepoint escape syntax ##\u##, e.g. ##\u{[0-9A-Fa-f]+}##.

This takes a Unicode codepoint in hexadecimal form, and outputs that codepoint in UTF-8 to a double-quoted string.

https://wiki.php.net/rfc/unicode_escape

===Sources===
[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration, https://www.php.net/manual/en/mbstring.configuration.php 
[1.2] A comment about SimpleXML on PHP.net: https://www.php.net/manual/en/ref.simplexml.php#79258 
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library, https://sourceforge.net/projects/phputf8 
[4] Web Application Component Toolkit, Handling UTF-8 with PHP, ((https://web.archive.org/web/20070319193643/http://www.phpwact.org/php/i18n/utf-8 http://www.phpwact.org/php/i18n/utf-8))
[5] W3C, Setting the HTTP charset parameter, https://www.w3.org/International/O-HTTP-charset.php