The intent of this article is to tie together some things I’ve learned to do in order to get my web apps to “play nicely” with the UTF8 character set. Before we go any further, let me state that I do not claim to be an expert on this; the following is simply a collection of things I’ve discovered here and there on the web, and which together seem to help smooth out most of the bumps in the road of using UTF8.
So let’s start with the database itself. To get your varchar and text fields talking UTF8, you should assign both the character set and a corresponding collation. (See the MySQL manual section on character sets and collations to see the differences between the various collation types.) You can assign this stuff at the field level should you desire, but generally I just assign it at the table level:
CREATE TABLE `test`.`example_table` (`sample_field` VARCHAR(255) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci
Now the next thing we need to do is tell the MySQL client being called from PHP that we are going to be speaking UTF8. I find it simplest to do this as part of the connection process and be done with it:
<?php $dbconnx = mysql_connect('localhost', $dbuser, $dbpass); mysql_query("
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'");
Your [X]HTML output needs to let the client know what character set you are using. It may not be sufficient to simply specify it within a <meta> Content-Type tag or the “encoding” attribute of your <?xml tag, as the web server may send its own default Content-Type header with another character set. Therefore you may want to add a PHP header function call. (The true for the 2nd parameter is so that this header replaces any existing content-type header.)
<?php header('Content-Type: text/html; charset="utf-8"', true);
Any [X]HTML forms used to input data into your application should be informed that you are expecting UTF-8 text from them. This can be done via the FORM tag’s “accept-charset” attribute:
<form action="script.php" method="post" accept-charset="UTF-8">
[Added 2008/10/04, edited 2009/04/24] Lastly, if you use the htmlentities() function to filter your output, be sure to set its optional 3rd argument to “UTF-8” so that it is aware of what character set is being used. If not, its output will likely be wrong for many entities.
If you have any other tips/tricks for dealing with UTF8 in PHP, add a comment here to share it with us.