Cleaning UTF-8 Streams for XML Parsing
Many applications I work on require the reading and parsing of third-party XML streams. These streams often originate with user-entered data and subsequently end up with invalid characters in. Of course, this should never happen, but it does and when it does, you need to be ready for it.
It surprised me to learn that not all valid UTF-8 characters are in fact valid in well-formed XML. So with a combination of this fact, and with remote applications allowing user-entered non UTF-8 data into the stream, I wrote the following function to clean up the data.
It is self-contained and simply allows through valid UTF-8 ranges, and blocks characters that are not in those valid ranges. These include non-printable control characters (except newline and linefeed), extended ASCII (which it attempts to transform to UTF-8) and other Unicode ranges that have been declared not-for-use by the W3C.
[sourcecode language="php"]<?php
/**
* Name: clean_utf8_xml_string
* Purpose: to remove or transform bytes or characters in a UTF-8 stream that
* will cause problems when parsed as XML. Not every UTF-8 character is a valid
* XML character.
* Author: Jason Judge
* Licence: GPL V3
* Created: 2010-01-07
*
* Takes a UTF-8 string and replaces any character that is not valid in an XML document.
*
* Note this does not require PCRE unicode libraries, which are often a problem on hosted
* servers.
*
* Inspiration taken from http://stackoverflow.com/questions/1401317/remove-non-uft8-characters-from-string
*/
function clean_utf8_xml_string($matches)
{
// This part handles the callback.
if (is_array($matches)) {
if (isset($matches[1]) && $matches[1] !== ”) {
// Compatibility characters.
// Return as-is for now, but could map it to another character.
return $matches[1];
} elseif (isset($matches[2]) && $matches[2] !== ”) {
// Valid UTF-8 for XML
return $matches[2];
} elseif (isset($matches[3]) && $matches[3] !== ”) {
// Invalid single-byte characters.
// Instead of removing these, we can assume they are another character set and map them.
// Assume they are ISO8859-1 for now, but this could be parameterized.
return iconv(‘ISO-8859-1′, ‘UTF-8′, $matches[3]);
} elseif (isset($matches[4]) && $matches[4] !== ”) {
// Control characters – no mappings – so return a replacement character.
// You may wish to return something different, or nothing at all.
return ‘?’;
}
}
// This part handles the first instance.
if (is_string($matches)) {
return preg_replace_callback(‘/’
// Ranges recommended to avoid – "compatibility characters".
// See http://www.w3.org/TR/REC-xml/ for the character ranges.
. ‘([x7F-x84]|[x86-x9F]|[xFD][xD0-xEF]|[x1Fx2Fx3Fx4Fx5Fx6Fx7Fx8Fx9FxAFxBFxCFxDFxEFxFFx10][xFF][xFExFF])’
// Broad valid UTF-8 multi-byte ranges.
. ‘|([x09x0Ax0D]|[x20-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})’
// Invalid single-byte characters which are likely to be extended ASCII and may be convertable to UTF-8 equivalents.
. ‘|([x80-xBF]|[xC0-xFF])’
// Fall-through – whatever is left, which should be single-byte control characters.
. ‘|(.)’
// If this is used as a static method, then replace __FUNCTION__ with __CLASS__ . ‘::’ . __METHOD__
. ‘/’, __FUNCTION__, $matches
);
}
}
?>
[/sourcecode]
It is downloadable here: