Migration to Unicode (UTF-8)
disadvantages
- mishmash of external libraries
- development culture means that people work on what theyβre interested in
requirements
- PCRE UTF-8 support: will not run if your PHP installation is not compiled with UTF-8 support in the PCRE extension.
- 3rd party libraries used by WackoWiki also require UTF-8 support
- utf8mb4: MySQL versions prior to 5.7.7 or MariaDB 10.2.2 do not have the innodb_large_prefix option enabled by default.
1. Unicode normalization
In essence, normalize all input from sources where you can't be sure that it's in normal form. In most cases, you should use NFC because most data will be in NFC already.
2. Steps
- Convert to UTF-8 without BOM.
- make sure that any included/required file is in either in ASCII or UTF-8 without BOM, as php doesn't handle non-ascii file very good
- Tools: iconv
-
find . -name "*.php" -exec iconv -f ISO-8859-1 -t UTF-8 {} -o {}.new \;
-
- search code base for non-UTF8 compatible functions and replace them
- add missing functions, not covered by PHP Multibyte String extention, to
lib/mb_extends/mb_extends.php
- add missing functions, not covered by PHP Multibyte String extention, to
- add optional support to replace SafeHtml with HTMLPurifier
- write a database conversion script that runs through the database and re-encodes everything as UTF-8
- String access by character
- remove unneeded functions notes
-
htmlentities
-
html_entity_decode
-
2.1. MySQL β Migrating a database data that is already encoded in latin1 to UTF-8
If you have an existing MySQL database that is already encoded in latin1, hereβs how to convert the latin1 to UTF-8:
- migrate all code and data in a single shot
- downtime needed
- trial runs needed
- watch for cp1252 vs. ISO-8859-1