7th February 2010

Many applications I work on require the reading and parsing of third-party XML streams. These streams often originate with user-entered data and subsequently end up with invalid characters in. Of course, this should never happen, but it does and when it does, you need to be ready for it.

It surprised me to learn that not all valid UTF-8 characters are in fact valid in well-formed XML. So with a combination of this fact, and with remote applications allowing user-entered non UTF-8 data into the stream, I wrote the following function to clean up the data.

It is self-contained and simply allows through valid UTF-8 ranges, and blocks characters that are not in those valid ranges. These include non-printable control characters (except newline and linefeed), extended ASCII (which it attempts to transform to UTF-8) and other Unicode ranges that have been declared not-for-use by the W3C.

<?php
/**
* Name: clean_utf8_xml_string
* Purpose: to remove or transform bytes or characters in a UTF-8 stream that
* will cause problems when parsed as XML. Not every UTF-8 character is a valid
* XML character.
* Author: Jason Judge
* Licence: GPL V3
* Created: 2010-01-07
*
* Takes a UTF-8 string and replaces any character that is not valid in an XML document.
*
* Note this does not require PCRE unicode libraries, which are often a problem on hosted
* servers.
*
* Inspiration taken from http://stackoverflow.com/questions/1401317/remove-non-uft8-characters-from-string
*/

function clean_utf8_xml_string($matches)
{
// This part handles the callback.
if (is_array($matches)) {
if (isset($matches[1]) && $matches[1] !== '') {
// Compatibility characters.
// Return as-is for now, but could map it to another character.
return $matches[1];
} elseif (isset($matches[2]) && $matches[2] !== '') {
// Valid UTF-8 for XML
return $matches[2];
} elseif (isset($matches[3]) && $matches[3] !== '') {
// Invalid single-byte characters.
// Instead of removing these, we can assume they are another character set and map them.
// Assume they are ISO8859-1 for now, but this could be parameterized.
return iconv('ISO-8859-1', 'UTF-8', $matches[3]);
} elseif (isset($matches[4]) && $matches[4] !== '') {
// Control characters - no mappings - so return a replacement character.
// You may wish to return something different, or nothing at all.
return '?';
}
}

// This part handles the first instance.
if (is_string($matches)) {
return preg_replace_callback('/'
// Ranges recommended to avoid - "compatibility characters".
// See http://www.w3.org/TR/REC-xml/ for the character ranges.
. '([\x7F-\x84]|[\x86-\x9F]|[\xFD][\xD0-\xEF]|[\x1F\x2F\x3F\x4F\x5F\x6F\x7F\x8F\x9F\xAF\xBF\xCF\xDF\xEF\xFF\x10][\xFF][\xFE\xFF])'

// Broad valid UTF-8 multi-byte ranges.
. '|([\x09\x0A\x0D]|[\x20-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})'

// Invalid single-byte characters which are likely to be extended ASCII and may be convertable to UTF-8 equivalents.
. '|([\x80-\xBF]|[\xC0-\xFF])'

// Fall-through - whatever is left, which should be single-byte control characters.
. '|(.)'

// If this is used as a static method, then replace __FUNCTION__ with __CLASS__ . '::' . __METHOD__
. '/', __FUNCTION__, $matches
);
}
}

?>

It is downloadable here:

clean_utf8_xml_string.php

Follow me on Twitter

Responses

  1. leon says: 25th May 2010 at 11:39 am

    This is superb thanks for sharing this, it’s made my day.

    I had been hit by this problem using simple_xml with some third party xml which it refused to parse due to the invalid characters. Just using iconv or utf8_encode did not do the job and like you was also surprised to learn about the well formed UTF-8 problem. Other solutions I found removed too many characters but yours worked a treat.

    Keep up the blogging and thanks again.

  2. Jason says: 26th May 2010 at 11:25 am

    Glad you found it useful Leon. If you find any bugs, just post them here and I’ll get them included.

    Now I’ve come back I realise the code is not formatting properly in the post, but the downloadable version should still be okay.

    – Jason

  3. teo says: 14th August 2010 at 9:19 pm

    Thank you so much for this!
    It really helped preparing some string for importing into magento using its soap api.
    Really feel like the poster above! :)

Leave a Comment