Migrating to Unicode

Resources

  1. The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
  2. Character Sets / Character Encoding Issues
  3. http://www.phpwact.org/php/i18n/utf-8
  4. Migrating to Unicode
    1. http://www.inter-locale.com/wh[..]n/learn-to-test.html
  5. http://en.wikipedia.org/wiki/UTF-8
  6. Unicode Cheat Sheet
  7. http://blog.loftdigital.com/blog/php-utf-8-cheatsheet
  8. http://htmlpurifier.org/docs/enduser-utf8.html
  9. Handling Unicode Front to Back in a Web App
  10. What every programmer absolutely, positively needs to know about encodings and character sets to work with text
  11. http://de.php.net/mbstring
  12. http://www.php.net/manual/en/ref.iconv.php
  13. https://www.sitepoint.com/brin[..]-with-portable-utf8/
  14. https://wiki.php.net/rfc/unicode_escape
  15. https://wiki.php.net/ideas/php6
  16. http://www.utf8everywhere.org/
  17. https://stackoverflow.com/ques[..]ault/6163129#6163129

Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to “upgrade” to your spiffy new Brave New World modernity.
.
It is way way way more complicated than people pretend. I’ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, you’ll break either your own code or somebody else’s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.

[..]
Assume Brokenness


And that’s not all. There are million broken assumptions that people make about Unicode. Until they understand these things, their code will be broken.

disadvantages


requirements

  1. PCRE UTF-8 support: will not run if your PHP installation is not compiled with UTF-8 support in the PCRE extension.

Unicode normalization

  1. Unicode Normalization Forms
  2. http://www.w3.org/TR/charmod-norm/
  3. Unicode Test Installation

Steps

  1. Convert to UTF-8 without BOM.
    • make sure that any included/required file is in either in ascii or UTF without BOM, as php doesn't handle non-ascii file very good
    • Tools: notepad++
    • Byte Order Mark: Bytesequenz EF BB BF -> ISO-8859-1: 
      • Byte-Order Mark found in UTF-8 File.
        • The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.

If you are reading in text files to insert into the middle of another page, it is strongly advised (but not strictly necessary) that you replace out the UTF-8 byte sequence for BOM "\xEF\xBB\xBF" before inserting it in, via:

<?php
$text = str_replace("\xEF\xBB\xBF", '', $text);
?>


  1. add to 2nd line in index.php

require('lib/mb_extends/mb_extends.php');


lib/mb_extends/mb_extends.php

<?php
 
function mb_strtr($str, $from, $to ,$chars = 'undefined')
{
$chars = mb_internal_encoding();
$_str = '';
$len = mb_strlen($str, $chars);
 
for($i = 0; $i < $len; $i++)
{
$flag = false;
 
for ($q = 0, $sf = mb_strlen($from, $chars), $st = mb_strlen($to, $chars); $q < $sf && $q < $st; $q++)
{
if (mb_substr($str, $i, 1, $chars) == mb_substr($from, $q, 1, $chars))
{
$_str = $_str . mb_substr($to, $q, 1, $chars);
$flag = true;
break;
}
}
 
if(!$flag)
{
$_str = $_str . mb_substr($str, $i, 1, $chars);
}
}
 
return $_str;
}
 
function mb_replace($search, $replace, $subject, &$count=0)
{
if (!is_array($search) && is_array($replace))
{
return false;
}
 
if (is_array($subject))
{
// call mb_replace for each single string in $subject
foreach ($subject as &$string)
{
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
}
else if (is_array($search))
{
if (!is_array($replace))
{
foreach ($search as &$string)
{
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
}
else
{
$n = max(count($search), count($replace));
 
while ($n--)
{
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
}
else
{
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts) - 1;
$subject = implode($replace, $parts);
}
 
return $subject;
}
 
?>


  1. search code base for non-UTF8 compatible functions and replace them
  2. replace SafeHtml with http://htmlpurifier.org/
    • require_once '/path/to/library/HTMLPurifier.auto.php';
      require '/path/to/HTMLPurifier.standalone.php';
      
      $purifier = new HTMLPurifier($config);
      $clean_html = $purifier->purify( $dirty_html );
  3. write a conversion script that runs through the database and re-encodes everything as UTF-8
  4. check our cloned branch (more soon): http://wackowiki.hg.sourceforg[..]/wackowiki/mbstring/
  5. String access by character
  6. remove unneeded functions notes
    1. htmlentities
    2. html_entity_decode
  7. <form accept-charset="utf-8">

MySQL – Migrating a database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8: