Migrating to Unicode

̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲ ̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲ ̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲

💩 𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤 💩

😈 ¡ƨdləɥ ƨᴉɥʇ ədoɥ puɐ ʻλɐp əɔᴉu ɐ əʌɐɥ ʻʞɔnl poo⅁ 😈

1. Resources

  1. The original paper from Bell Labs on UTF-8
  2. A tutorial on character code issues – the MUST READ
  3. The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
  4. Character Sets / Character Encoding Issues
  5. http://www.phpwact.org/php/i18n/utf-8
  6. Migrating to Unicode
    1. http://www.inter-locale.com/wh[...]n/learn-to-test.html
  7. http://en.wikipedia.org/wiki/UTF-8
  8. Unicode Cheat Sheet
  9. http://blog.loftdigital.com/blog/php-utf-8-cheatsheet
  10. http://htmlpurifier.org/docs/enduser-utf8.html
  11. Handling Unicode Front to Back in a Web App
  12. What every programmer absolutely, positively needs to know about encodings and character sets to work with text
  13. http://de.php.net/mbstring
  14. http://www.php.net/manual/en/ref.iconv.php
  15. https://www.sitepoint.com/brin[...]-with-portable-utf8/
  16. https://wiki.php.net/rfc/unicode_escape
  17. https://www.utf8-chartable.de/
  18. Unicode Regular Expressions
  19. http://www.utf8everywhere.org/
  21. Unicode Security Considerations
  22. Why does modern Perl avoid UTF-8 by default?

Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to “upgrade” to your spiffy new Brave New World modernity.
It is way way way more complicated than people pretend. I’ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, you’ll break either your own code or somebody else’s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.

Assume Brokenness

And that’s not all. There are million broken assumptions that people make about Unicode. Until they understand these things, their code will be broken.



  1. PCRE UTF-8 support: will not run if your PHP installation is not compiled with UTF-8 support in the PCRE extension.
  2. 3rd party libraries used by WackoWiki also require UTF-8 support

2. Unicode normalization

  1. Unicode Normalization Forms
  2. http://www.w3.org/TR/charmod-norm/
  3. Unicode Test Installation

3. Steps

  1. Convert to UTF-8 without BOM.
    • make sure that any included/required file is in either in ascii or UTF without BOM, as php doesn't handle non-ascii file very good
    • Tools: notepad++
    • Byte Order Mark: Bytesequenz EF BB BF -> ISO-8859-1: 
      • Byte-Order Mark found in UTF-8 File.
        • The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
    • iconv
      •  find . -name "*.php" -exec iconv -f ISO-8859-1 -t UTF-8 {} -o {}.new \;

If you are reading in text files to insert into the middle of another page, it is strongly advised (but not strictly necessary) that you replace out the UTF-8 byte sequence for BOM "\xEF\xBB\xBF" before inserting it in, via:

$text = str_replace("\xEF\xBB\xBF", '', $text);

  1. add to 2nd line in index.php


function mb_strtr($str, $from, $to ,$chars = 'undefined')
$chars = mb_internal_encoding();
$_str = '';
$len = mb_strlen($str, $chars);
for($i = 0; $i < $len; $i++)
$flag = false;
for ($q = 0, $sf = mb_strlen($from, $chars), $st = mb_strlen($to, $chars); $q < $sf && $q < $st; $q++)
if (mb_substr($str, $i, 1, $chars) == mb_substr($from, $q, 1, $chars))
$_str = $_str . mb_substr($to, $q, 1, $chars);
$flag = true;
$_str = $_str . mb_substr($str, $i, 1, $chars);
return $_str;
function mb_replace($search, $replace, $subject, &$count=0)
if (!is_array($search) && is_array($replace))
return false;
if (is_array($subject))
// call mb_replace for each single string in $subject
foreach ($subject as &$string)
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
else if (is_array($search))
if (!is_array($replace))
foreach ($search as &$string)
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
$n = max(count($search), count($replace));
while ($n--)
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts) - 1;
$subject = implode($replace, $parts);
return $subject;

  1. search code base for non-UTF8 compatible functions and replace them
  2. replace SafeHtml with http://htmlpurifier.org/
    • require_once '/path/to/library/HTMLPurifier.auto.php';
      require '/path/to/HTMLPurifier.standalone.php';
      $purifier = new HTMLPurifier($config);
      $clean_html = $purifier->purify( $dirty_html );
  3. write a conversion script that runs through the database and re-encodes everything as UTF-8
  4. check our cloned branch (more soon):
  5. String access by character
  6. remove unneeded functions notes
    1. htmlentities
    2. html_entity_decode
  7. <form accept-charset="utf-8">

3.1. MySQL – Migrating a database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8:

-> /Dev/Release/R7/utf8mb4