I spent way too long this weekend on a problem that had such a simple solution. I guess this issue may have been a little to do with the fact that I use the CodeIgniter framework, which does so much of the hard work for you. it’s easy to get complaisant.
I have been working with text files that contain multi-byte characters and had previously ensured that my database and tables were setup for UTF-8 and that everything in codeigniter was correctly configured. Yet still I was getting invalid character errors on the database insert.
As the text files were of varying formats, including excel’s unicode csv format, I had already ensured that the reading of the text file also included conversion to UTF-8. Thanks to the script on Practical Web Ltd, I was attempting to detect the format of the files and converting them to UTF-8 on the fly. Yet still I was getting invalid character errors on the database insert.
I even ran through my code line by line and checked for any string manipulation I was doing using non-safe string functions. Yet still I was getting invalid character errors on the database insert.
If I had any decent amount of hair left, I would certainly have pulled it all out by the time I figured out what was wrong. I only discovered the answer by accident when I decided to remove the string manipulation altogether. As soon as I did that, it worked a treat. Had I discovered a bug in the multibyte string functions? No.
I had not checked the default encoding of mbstring.
So please, make sure it is on your check list of things to do when dealing with multi-byte strings. Set up the default correctly or religiously use the encoding parameter in the multi-byte string functions.
Even better, you could use the great checklist on nicknettleton.com (see below), which seems to cover everything.
I totally deserved the dunce hat.
Edit: Looks like the link on nicknettleton.com is no longer available (thanks @Les). A little digging around led me to the same checklist on php UTF-8 on another site.