Summary

Headers must be written before any data is sent to the client. Unicode files may include a Byte-Order Mark (BOM) to help distinguish the big endian and little endian byte order. Unfortunately, the BOM isn't understood by PHP. Upon encountering the BOM, PHP assumes that it is dealing with data, by which time it's too late to modify headers. Solution? Save the file in UTF-8 encoding without a BOM.

Author: Gez Lemon

Yesterday, I stumbled across a problem when saving files in UTF format. When the script is copied to the server, I received a warning stating that the headers could not be modified as they had already been sent. I was sure that the headers hadn't already been written, but as far as PHP was concerned, they had. Having argued unsuccessfully with machines in the past, I decided to believe it, but couldn't understand how they could have been written. Saving the same file in regular ANSI solved the header problem, but the characters didn't display correctly, as they were no longer encoded correctly.

Simon Carlisle emailed me a link to a document called, Guide To Unicode, which mentions the type of problem I encountered in Part 3 of the article. The article mentions that if a UTF file is incorrectly declared as ISO-8859-1 by the HTTP response headers, a Byte Order Mark (BOM) will be interpreted as data by Apache, by which time the headers should have been written; hence the warnings I was receiving. What I couldn't understand is why the HTTP headers were incorrectly declaring ISO-8859-1.

The BOM consists of three bytes to distinguish the big endian and little endian byte order for UTF-16. As there's no requirement for UTF-8 to distinguish between big endian and little endian byte order, there's no reason to include a BOM; particularly if it's being interpreted as data on the server. Stupidly, my editor of choice is Notepad, which doesn't have an option to save as UTF without a BOM. In a desperate attempt, I wrote a simple script to remove the first three-bytes from the UTF file, to see if the BOM was definitely the problem in my case.

<?php
  $strInput = "colourcontrast-es.php";
  $strOutput = "new-es.php";

  if ($objInput = fopen($strInput, "r"))
  {
    if ($objOutput = fopen($strOutput, "w"))
    {
      $iByteCounter = 0;
      // Copy the file, ignoring the first three bytes
      while (($cByte = fgetc($objInput)) !== false)
        if ($iByteCounter++ > 2)
          fwrite($objOutput, $cByte);

      echo "<p>Removed Byte-Order Mark from $strInput into $strOutput.</p>\n";
      fclose($objOutput);
    }
    else
      echo "<p>Can't open the output file</p>\n";
    fclose($objInput);
  }
  else
    echo "<p>Can't open the input file</p>\n";
?>

I copied the resulting file to the server, and to my amazement, it worked. I'm pleased that I've at least found a solution to the problem, but it would be much easier to use an editor that provides an option to save UTF-8 without the BOM. Any suggestions?

Category: Programming.

Comments

  1. [utf-byte-order-mark.php#comment3]

    Thank you Pam and Hans. TextEdit and BabelPad both do the job well, but I think I prefer BabelPad as the interface is a lot more friendly.

    Both good finds though. Thank you both *smile*

    Posted by Gez on

  2. [utf-byte-order-mark.php#comment4]

    I totally agree, Gez. BabelPad is much more user-friendly *smile*

    Thanks, Hans!

    Posted by Pam on

  3. [utf-byte-order-mark.php#comment5]

    This morning, I was reading an article at a relatively new site Content With Style http://www.contentwithstyle.co.uk/ , it brought up some of the issues with the BOM.

    *UTF-8: Documents with a lot of character*
    http://tinyurl.com/d5oex

    Sorry about the tinyurl, however the link was too long for the comment and was being cropped.

    Good to see you back up and running, Gez.

    Posted by holly on

  4. [utf-byte-order-mark.php#comment6]

    Hey Gez, an old student of the ND here, probably won't remember me. I usually lurk but I thought I'd post my preference of text editor (supports UTF-8 and Unicode):

    http://www.vim.org/

    I've been writing PHP with it for ages now and have never encountered these BOM problems you've been having.

    Posted by Matthew on

  5. [utf-byte-order-mark.php#comment7]


    Hi Holly,

    This morning, I was reading an article at a relatively new site Content With Style http://www.contentwithstyle.co.uk/ it brought up some of the issues with the BOM.

    Cool, thank you for the heads up. The article mentions that some browsers have trouble with BOM, but I found it was Apache itself that was having problems with it. As soon as the BOM was removed, Apache worked fine. I've been using BabelPad to edit files, and that has an option to save without BOM, and seems to be working well.

    Sorry about the tinyurl, however the link was too long for the comment and was being cropped.

    Thank you for the heads up about this, too. I've fixed the URL cropping.

    Good to see you back up and running, Gez.

    Thank you *smile* I can't believe how stressful it was losing my website, but I'm enjoying putting it back together. Hopefully, I won't have to go through that again, as the host I've chosen is very good. If I do have to move, for whatever reason, it should be less stressful as there seems to be far more good Apache based hosts than IIS.

    Posted by Gez on

  6. [utf-byte-order-mark.php#comment8]

    Hi Matthew,

    Hey Gez, an old student of the ND here, probably won't remember me.

    I probably would remember you, but there were a lot of Matthews. I think one class had about 4 Matthews in alone, and there were Matthews in some of the other classes, too *smile* [quote]I usually lurk but I thought I'd post my preference of text editor (supports UTF-8 and Unicode):

    http://www.vim.org/[/quote]Thank you, I'll check it out.

    Cheers,

    Posted by Gez on

  7. [utf-byte-order-mark.php#comment10]

    Thank you, Zcorpan, that is really cool. Syntax highlighting, partial support for regular expressions, multiple undo and redo, and UTF without the BOM *smile* This is definitely the best editor I've tried so far.

    The one criticism is that it doesn't seem to use the same encoding that the file is saved in. Each time I edit a file, I have to change the encoding, even though I saved it with that encoding using Notepad 2. I've set the default to UTF-8, but that doesn't seem to have made much difference. Very nice find, though - thank you for suggesting it *smile*

    Posted by Gez on

  8. [utf-byte-order-mark.php#comment12]

    You should see "Why are my UTF-8 files loaded as ANSI?"

    They mention that it would be treated as UTF if the BOM is included. That makes sense, but then we're back with the problem of Apache thinking that the content has started. I can live with putting a comment at the top of all files that contains an extended character. That seems to work well, and makes it perfect as an editor.

    Thanks again for the suggestion.

    Posted by Gez on

  9. [utf-byte-order-mark.php#comment14]

    I beleive the BOM is 3 bytes long.

    Thank you, James. You're right, it is 3 bytes long. I mistakenly throught it was 2, but it didn't work until I removed 3 bytes. I forgot to update the text to 3-bytes, but I've updated it now.

    Many thanks.

    Posted by Gez on

  10. [utf-byte-order-mark.php#comment15]

    i can't believe no one has mentioned this before: my favourite editor: SciTe: http://www.scintilla.org/SciTE.html
    I supports numerous formats. Encodings are available in the file menu. The BOF format is called "UTF-8", and the non-BOF is called "UTF-8 cookie".
    As a related issue, I use
    <?php header('Content-Type: text/html; charset=utf-8'); ?>
    and
    <meta http-equiv=Content-Type content="text/html; charset=utf-8" />
    to bypass Apache settings and to help browsers auto-detect the encoding if the worst happens (the pages are forcedly served with the wrong encoding)

    Posted by DanTe on

  11. [utf-byte-order-mark.php#comment16]


    Did you ever consider the option of using the output buffer? Then the whole problem is actually obsolete I guess, 'cuz redirects and headers are sent before definetly ... wich just leaves the problem that the page in Mac IE will go tits up cuz the BOM is outputted. But you can replace that in the ob_callback function.

    Posted by Pascal Opitz on

  12. [utf-byte-order-mark.php#comment17]

    Hi Pascal,

    I didn't consider the output buffer, as I'm new to PHP, but it does look to be a good option.

    Thank you

    Posted by Gez on

Comments are closed for this entry.