Check out Scrivo

Do you want to try out Scrivo? Then here's a demo for you that does not just get your feet wet but lets you plunge right in.

Contact us

For more information, please contact us. We're happy to help you out!

Next Sep 16 Previous

Item 547207

You host a web shop that is selling classic albums. But the database that provides the content for the shop is classic too and is still using ISO-8859-1 encoding.

The data is stored on another server than the shop's front-end server. The front-end server retrieves XML data by sending HTTP requests to the data server. The data server uses PHP for generating content and an example of what it might send in response to a request is given below:

<?php
header("Content-type: text/xml");
echo "<?xml version='1.0' encoding='ISO-8859-1'?>";
?>
<album>
        <zce-question>
<title>Al Verte las Flores Lloran</title>
        <artist>Camarón de la Isla and Paco de Lucía</artist>
        <tracks>
                <track no="1">
                        <zce-question>
<title>Al verte las flores lloran</title>
                        <palo>Bulerías</palo>
                </track>
                <track no="2">
                        <zce-question>
<title>En una piedra me acosté</title>
                        <palo>Fandangos</palo>
                </track>
                <track no="3">
                        <zce-question>
<title>Anda y no presumas más</title>
                        <palo>Bulerías por soleá</palo>
                </track>
        </tracks>
</album>

On some day a new system administrator is appointed and he/she decides that is time to get up to date. So he/she edits the php.ini file on the server, adds the line default_charset = utf-8 to it and restarts the server.

What happens?

A: Nothing because the encoding is given in the file.
B: Nothing, but the data send by the server will now be valid UTF-8 encoded data.
C: All the strange characters will be displayed like "Camarón" or "acosté".
D: All the strange characters will be displayed like "Camar�n" or "acost�".
E: The front-end server will not be able to parse the data.

Answer

The change in the ini setting causes PHP to append ;charset=UTF-8 to all Content-type headers where it is not explicitly set.

Despite the content type was set in the file on line 3 it was not set in the Content-type header. Since the latter prevails over the one set in the content and now will be send as Content-Type: text/xml; charset=UTF-8 the client will assume that the data send is using the UTF-8 encoding. But the actual content of the file is still ISO-8859-1 encoded, PHP does not automatically change that.

So answers A and B are false and we have a problem. I'm sure you've seen the examples of malformed strings given in answers C and D. You typically see things like "Camarón" or "acosté" when UTF-8 data is displayed in a single-byte encoding scheme such as ISO-8859-1: multi-byte UTF-8 characters are displayed as sequences of single-byte characters. And "Camar�n" or "acost�" is typlically displayed when data using a single-byte encoding is displayed as UTF-8: All characters with an ordinal value higher than 127 are displayed as the UTF-8 replacement character.

So answer C is incorrect too: in our case the data is still ISO-8859-1 encoded but it will be interpreted as UTF-8.

The correct answer is D or E and it depends on how forgiving the XML parser of the front-end server is. If it was written by someone with lots of empathy it could be that the invalid UTF-8 characters will be replaced by the UTF-8 replacement character as in answer D. If it was some hardliner you are sure to get a parse error.