PHP UTF-8 cheatsheet

Stopgap guide to going UTF-8

When we started building DropSend, we decided to support all languages worldwide from the start. The interface is currently in English only, but the application can send, store, sort and process your data whatever language you want. As a result, we have a good number of customers out east.

To support worldwide languages, you need to use UTF-8 encoding for your web pages, emails and application, rather than ISO 8859-1 or another common western encoding, since these don’t support characters used in languages such as Japanese and Chinese.

Happily, UTF-8 is transparent to the core Latin characterset, so you won’t need to convert all your data to start using UTF-8. But there are a number of other issues to deal with. In particular, because UTF-8 is a multibyte encoding, meaning one character can be represented by more one or more bytes. This causes trouble for PHP, because the language parses and processes strings based on bytes, not characters, and makes mincemeat multibyte strings – for example, by splitting characters ‘in half’, bodging up regular expressions, and rendering email unreadable.

There are a number of great articles online about UTF-8 and how it works – Joel Spolski’s comes to mind – but very few about how to actually get it working with PHP and iron out all the bugs. So, here to save you the time we put in, is a quick cheatsheet and info about a few common issues.

1. Update your database tables to use UTF-8

CREATE DATABASE db_name
	CHARACTER SET utf8
	DEFAULT CHARACTER SET utf8
	COLLATE utf8_general_ci
	DEFAULT COLLATE utf8_general_ci
	;

ALTER DATABASE db_name
	CHARACTER SET utf8
	DEFAULT CHARACTER SET utf8
	COLLATE utf8_general_ci
	DEFAULT COLLATE utf8_general_ci
	;

ALTER TABLE tbl_name
	DEFAULT CHARACTER SET utf8
	COLLATE utf8_general_ci
	;

2. Install the mbstring extension for PHP

Windows: download the dll if it’s not in your PHP extensions folder, and uncomment the relevant line in your php.ini file: extension=php_mbstring.dll
Linux: yum install php-mbstring

3. Configure mbstring

Do this in php.ini, httpd.conf or .htaccess. (Remember to prepend these with ‘php_value ‘ in httpd.conf or .htaccess.)

mbstring.language		= Neutral	; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding	= UTF-8		; Set default internal encoding to UTF-8
mbstring.encoding_translation	= On		;  HTTP input encoding translation is enabled
mbstring.http_input		= auto		; Set HTTP input character set dectection to auto
mbstring.http_output		= UTF-8		; Set HTTP output encoding to UTF-8
mbstring.detect_order		= auto		; Set default character encoding detection order to auto
mbstring.substitute_character	= none		; Do not print invalid characters
default_charset			= UTF-8		; Default character set for auto content type header

4. Deal with non-multibyte-safe functions in PHP

The fast-and-loose way to do this is with the following php configuration:

mbstring.func_overload = 7 ; All non-multibyte-safe functions are overloaded with the mbstring alternatives

But there are problems with this. php.net has a warning about this potentially affecting the whole server. And even if this isn’t an issue for you, mbstring can make a mess of binary strings.

So, a better route is to search your application code for the following functions, and replace them with mbstring’s ‘slot-in’ alternatives:

mail()		-> mb_send_mail()
strlen()	-> mb_strlen()	
strpos()	-> mb_strpos()
strrpos()	-> mb_strrpos()
substr()	-> mb_substr()
strtolower()	-> mb_strtolower()
strtoupper()	-> mb_strtoupper()
substr_count()	-> mb_substr_count()
ereg()		-> mb_ereg()
eregi()		-> mb_eregi()
ereg_replace()	-> mb_ereg_replace()
eregi_replace()	-> mb_eregi_replace()	
split()		-> mb_split()

5. Sort out HTML entities

The htmlentities() function doesn’t work automatically with multibyte strings. To save time, you’ll want to create a wrapper function and use this instead:

/**
 * Encodes HTML safely for UTF-8. Use instead of htmlentities.
 *
 * @param string $var
 * @return string
 */
function html_encode($var)
{
	return htmlentities($var, ENT_QUOTES, 'UTF-8') ;
}

6. Check content-type headers

Check through your code for any text-based content-type headers, and append the UTF-8 charset, so the browser knows what it’s working with:

header('Content-type: text/html; charset=UTF-8') ;

You should also repeat this at the top of HTML pages:

<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />

7. Update email scripts

Email can be tricky. You’ll need to update the content-type for any emails and text-based mime parts to use UTF-8 encoding. You’ll also need to alter the way in which headers are encoded to use UTF-8. mbstring provides a function mb_encode_mimeheader() to handle this for you, but it does make a mess of address lists: you’ll need to encoding the name and address parts seperately, then compile them into an address list.

Be sure to encode the subject and other headers too – Korean speakers will tend to put Korean text for the subject.

9. Check binary files and strings

Finally, double check any binary files and strings handled by PHP, particularly uploads, downloads and encryption. In some cases it may be necessary to revert to ASCII just before a download or processing a binary string.

Latest News & Insights

Say connected – get Loft updates straight to your inbox.