UTF8 in PHP and MySQL

The intent of this article is to tie together some things I’ve learned to do in order to get my web apps to “play nicely” with the UTF8 character set. Before we go any further, let me state that I do not claim to be an expert on this; the following is simply a collection of things I’ve discovered here and there on the web, and which together seem to help smooth out most of the bumps in the road of using UTF8.

So let’s start with the database itself. To get your varchar and text fields talking UTF8, you should assign both the character set and a corresponding collation. (See the MySQL manual section on character sets and collations to see the differences between the various collation types.) You can assign this stuff at the field level should you desire, but generally I just assign it at the table level:

CREATE TABLE `test`.`example_table` (
`sample_field` VARCHAR(255) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci

Now the next thing we need to do is tell the MySQL client being called from PHP that we are going to be speaking UTF8. I find it simplest to do this as part of the connection process and be done with it:

$dbconnx = mysql_connect('localhost', $dbuser, $dbpass);
mysql_query("SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'");

Your [X]HTML output needs to let the client know what character set you are using. It may not be sufficient to simply specify it within a <meta> Content-Type tag or the “encoding” attribute of your <?xml tag, as the web server may send its own default Content-Type header with another character set. Therefore you may want to add a PHP header function call. (The true for the 2nd parameter is so that this header replaces any existing content-type header.)

header('Content-Type: text/html; charset="utf-8"', true);

Any [X]HTML forms used to input data into your application should be informed that you are expecting UTF-8 text from them. This can be done via the FORM tag’s “accept-charset” attribute:

<form action="script.php" method="post" accept-charset="UTF-8">

[Added 2008/10/04, edited 2009/04/24] Lastly, if you use the htmlentities() function to filter your output, be sure to set its optional 3rd argument to “UTF-8″ so that it is aware of what character set is being used. If not, its output will likely be wrong for many entities.

If you have any other tips/tricks for dealing with UTF8 in PHP, add a comment here to share it with us.


UTF8 in PHP and MySQL — 4 Comments

  1. Interesting :). Here’s my experience with encodings and stuff like that (I’m not an expert either):
    I built this application which had a php script that ran in the command line and that was connecting to other servers and storing data in a database. This data would contain “funny characters”. The problem was not with inserting the data, which worked flawlessly (even though the tables and fields are set to the latin1_swedish_ci collation — I think that’s the default one phpmyadmin uses), but with reading the data. In phpmyadmin I would see the data just like it should appear, but in my own application it appeared incorrectly. I tried everything, utf8_decode(), iconv()m, etc. I was very frustrated and searched the web for hours, and came up with a simple solution:
    mysql_query(“SET NAMES ‘utf8′”);
    Pretty similar to yours, indeed :P. I should probably convert those tables to utf8_unicode_ci, shouldn’t I?
    Off topic: can you make this text box a bit wider? I’m getting a claustrophobic kind of feeling typing here :P


    PS: Yes! I’m registered! :)

  2. Off topic: can you make this text box a bit wider? I’m getting a claustrophobic kind of feeling typing here

    How’s this?

  3. Regarding the 2008/10/04 addition: I think using htmlentities($string, ENT_COMPAT, “UTF-8″); should work fine and it would even be safer than htmlspecialchars() because it will convert more characters. :-)