7th February 2010

Many applications I work on require the reading and parsing of third-party XML streams. These streams often originate with user-entered data and subsequently end up with invalid characters in. Of course, this should never happen, but it does and when it does, you need to be ready for it.

It surprised me to learn that not all valid UTF-8 characters are in fact valid in well-formed XML. So with a combination of this fact, and with remote applications allowing user-entered non UTF-8 data into the stream, I wrote the following function to clean up the data.

It is self-contained and simply allows through valid UTF-8 ranges, and blocks characters that are not in those valid ranges. These include non-printable control characters (except newline and linefeed), extended ASCII (which it attempts to transform to UTF-8) and other Unicode ranges that have been declared not-for-use by the W3C.

<?php
/**
* Name: clean_utf8_xml_string
* Purpose: to remove or transform bytes or characters in a UTF-8 stream that
* will cause problems when parsed as XML. Not every UTF-8 character is a valid
* XML character.
* Author: Jason Judge
* Licence: GPL V3
* Created: 2010-01-07
*
* Takes a UTF-8 string and replaces any character that is not valid in an XML document.
*
* Note this does not require PCRE unicode libraries, which are often a problem on hosted
* servers.
*
* Inspiration taken from http://stackoverflow.com/questions/1401317/remove-non-uft8-characters-from-string
*/

function clean_utf8_xml_string($matches)
{
// This part handles the callback.
if (is_array($matches)) {
if (isset($matches[1]) && $matches[1] !== '') {
// Compatibility characters.
// Return as-is for now, but could map it to another character.
return $matches[1];
} elseif (isset($matches[2]) && $matches[2] !== '') {
// Valid UTF-8 for XML
return $matches[2];
} elseif (isset($matches[3]) && $matches[3] !== '') {
// Invalid single-byte characters.
// Instead of removing these, we can assume they are another character set and map them.
// Assume they are ISO8859-1 for now, but this could be parameterized.
return iconv('ISO-8859-1', 'UTF-8', $matches[3]);
} elseif (isset($matches[4]) && $matches[4] !== '') {
// Control characters - no mappings - so return a replacement character.
// You may wish to return something different, or nothing at all.
return '?';
}
}

// This part handles the first instance.
if (is_string($matches)) {
return preg_replace_callback('/'
// Ranges recommended to avoid - "compatibility characters".
// See http://www.w3.org/TR/REC-xml/ for the character ranges.
. '([\x7F-\x84]|[\x86-\x9F]|[\xFD][\xD0-\xEF]|[\x1F\x2F\x3F\x4F\x5F\x6F\x7F\x8F\x9F\xAF\xBF\xCF\xDF\xEF\xFF\x10][\xFF][\xFE\xFF])'

// Broad valid UTF-8 multi-byte ranges.
. '|([\x09\x0A\x0D]|[\x20-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})'

// Invalid single-byte characters which are likely to be extended ASCII and may be convertable to UTF-8 equivalents.
. '|([\x80-\xBF]|[\xC0-\xFF])'

// Fall-through - whatever is left, which should be single-byte control characters.
. '|(.)'

// If this is used as a static method, then replace __FUNCTION__ with __CLASS__ . '::' . __METHOD__
. '/', __FUNCTION__, $matches
);
}
}

?>

It is downloadable here:

clean_utf8_xml_string.php

Follow me on Twitter

6th February 2010

nlite logoI came across this neat Windows application today. It is basically a tool for creating customised Windows installation disks.

I’ve seen similar tools for slip-streaming service packs into disks, but this one goes a step further and makes it dead easy to include drivers and various other Windows tweaks. It is all very straight-forward to use:

  1. Point it at the original Windows disk.
  2. Tell it what you want to change (point it at service packs, drivers, etc.)
  3. Tell it to generate an ISO.
  4. Press GO.

The application will then spit out an ISO image that you burn to CDROM. I’m sure there are ways to install it over the network without burning a CD too.

I found it while looking for a way to reinstall XP onto Dell machine with a SATA device that the default Windows disk did not recognise.

nlite – http://www.nliteos.com/ – a great application for your toolbox.

Follow me on Twitter