Home » Topics » 

UTF8

PHP UTF-8 cheatsheet

Nick Nettleton | 03 July 2006

When we started building DropSend, we decided to support all languages worldwide from the start. The interface is currently in English only, but the application can send, store, sort and process your data whatever language you want. As a result, we have a good number of customers out east.

To support worldwide languages, you need to use UTF-8 encoding for your web pages, emails and application, rather than ISO 8859-1 or another common western encoding, since these don't support characters used in languages such as Japanese and Chinese.

Happily, UTF-8 is transparent to the core Latin characterset, so you won't need to convert all your data to start using UTF-8. But there are a number of other issues to deal with. In particular, because UTF-8 is a multibyte encoding, meaning one character can be represented by more one or more bytes. This causes trouble for PHP, because the language parses and processes strings based on bytes, not characters, and makes mincemeat multibyte strings - for example, by splitting characters 'in half', bodging up regular expressions, and rendering email unreadable.

There are a number of great articles online about UTF-8 and how it works - Joel Spolski's comes to mind - but very few about how to actually get it working with PHP and iron out all the bugs. So, here to save you the time we put in, is a quick cheatsheet and info about a few common issues.

Read more 69 comments