Migration to Unicode (UTF-8)


disadvantages


requirements

  1. PCRE UTF-8 support: will not run if your PHP installation is not compiled with UTF-8 support in the PCRE extension.
  2. 3rd party libraries used by WackoWiki also require UTF-8 support
  3. utf8mb4: MySQL versions prior to 5.7.7 or MariaDB 10.2.2 do not have the innodb_large_prefix option enabled by default.

1. Unicode normalization

  1. http://www.w3.org/TR/charmod-norm/

In essence, normalize all input from sources where you can't be sure that it's in normal form. In most cases, you should use NFC because most data will be in NFC already.

2. Steps

  1. Convert to UTF-8 without BOM.
    • make sure that any included/required file is in either in ASCII or UTF-8 without BOM, as php doesn't handle non-ascii file very good
    • Tools: iconv
      •  find . -name "*.php" -exec iconv -f ISO-8859-1 -t UTF-8 {} -o {}.new \;	
  2. search code base for non-UTF8 compatible functions and replace them
    • add missing functions, not covered by PHP Multibyte String extention, to lib/mb_extends/mb_extends.php
  3. add optional support to replace SafeHtml with HTMLPurifier
  4. write a database conversion script that runs through the database and re-encodes everything as UTF-8
  5. String access by character
  6. remove unneeded functions notes
    1. htmlentities
    2. html_entity_decode

2.1. MySQL – Migrating a database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8:


  • migrate all code and data in a single shot
  • downtime needed
  • trial runs needed
  • watch for cp1252 vs. ISO-8859-1

  1. utf8mb4
  2. Converting your MySQL database to UTF8
  3. Database Conversion Script

3. Resources

!/Resources