𝔘𝔫𝔦𝔠𝔬𝔡𝔢 Resources

  1. The original paper from Bell Labs on UTF-8
  2. A tutorial on character code issues - the MUST READ
  3. Mojibake
  4. Character Sets / Character Encoding Issues
  5. Handling UTF-8 with PHP
  6. Migrating to Unicode
  7. http://en.wikipedia.org/wiki/UTF-8
  8. Unicode block
  9. Unicode Cheat Sheet
  10. http://htmlpurifier.org/docs/enduser-utf8.html
  11. What every programmer absolutely, positively needs to know about encodings and character sets to work with text
  12. https://www.sitepoint.com/brin[...]-with-portable-utf8/
  13. W3C: Character Model for the World Wide Web: String Matching
  14. https://unicode.org
    1. Unicode Regular Expressions
    2. Unicode Bidirectional Algorithm
    3. Unicode Security Considerations
    4. Unicode Normalization Forms
  15. http://www.utf8everywhere.org/
  16. Quotes

̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲ ̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲ ̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲


💩 𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤 💩


😈 ¡ƨdləɥ ƨᴉɥʇ ədoɥ puɐ ʻλɐp əɔᴉu ɐ əʌɐɥ ʻʞɔnl poo⅁ 😈

PHP 7.x/8.x

  • Strings literals in PHP are still fundamentally composed of bytes.
  • It is up to developers to deal with character encoding issues using mbstring, iconv, uconverter, etc.
  • The intl extension wraps a lot of the functionality that was originally going to be a part of PHP 6 for use in PHP 7/8.
  • PHP 7 helps by adding the inline UTF-8 literal syntax \u{[0-9A-Fa-f]+} and IntlChar class.

Unicode & IDNA

RFC 3986 neither supports IDNA, nor non-ASCII characters. WHATWG URL supports IDNA and Unicode characters, and it explicitly suggests that browsers should render the host component by displaying Unicode characters.
The recommendation is not just for user-friendliness: it's necessary for security reasons, alleviating the human risk factor in exploits. E.g. “xn--google.com” could deceive the uninitiated reader that it is a Google domain, however the IDNA domain decodes to “䕮䕵䕶䕱.com” in fact.

Read comment (1 comment)