Tim Blair

Writing UTF-8 Text Files with ColdFusion (13/05/2004)

Recently I was having problems trying to write out UTF-8 encoded files using ColdFusion. I would write out data, correctly encoded, from the database, but when the file was read back in, all character encoding was lost - CF was reading the file as single- not double-byte.

My guess was that one of two things was wrong: either the file wasn't being written correctly, or was being read incorrectly. A few more trials and it turned out that, although the <cffile> call to write out the data was being made with charset="UTF-8", the file wasn't actually being written as UTF-8. The contents itself was but ColdFusion did not recognise the file as UTF-8... Time for Google.

After a stressfull afternoon I found the problem, and it's not a problem with the code, the file system, or ColdFusion. In fact, it's a limitation of the underlying JVM. There are two parts to a UTF-8 file - firstly the data must be correctly encoded and secondly the file should be marked with a "Byte Order Mark" (BOM) which allows whatever process is reading the file to spot that it's a UTF-8 (double-byte) file and treat the contents accordingly.

As shown in a list of supported encodings, the Sun 1.4.2 JVM does not support adding the BOM to UTF-8 files. Why? I don't know, but that's where the problem lies - because this BOM was missing, ColdFusion (or rather Java) simply did not know that it had to treat the file differently to a "standard" ANSI encoded file.

Now I'd worked out the problem, the question was how to get around it. I came up with two possibilities - the first is quick and very dirty, the second longer and more complicated but if a better solution. The first idea is that instead of using UTF-8, you use UTF-16. Java supports the BOM on UTF-16 files so that works well, but of course that's not exactly a great solution!

The second solution is to write the file not using <cffile> but instead using Java from CF, and manually writing the BOM to the beginning of the stream. I've not found any problems with this method - it generates a valid UTF-8 file that ColdFusion (Java) can understand and <cfinclude> absolutely fine! Hooray! Below is a small snippit which shows how to write a UTF-8 file with the correct BOM:

Article Archive (September '03 – '05)