cut-and-paste technique for complex regexps

07 March 2008 // php. stuff.

It's often desired to transform some text ignoring specific parts of it. The "classical" example is to replace something in html, not touching the tags.

 

Consider the following:

<span class='foo'>the pic of my class mate</span>

We need to replace "class" with "school". The naive

$html = preg_replace('/class/', 'school', $html);

will obviously break the formatting. We need a more sophisticated expression with lookaround groups. Although it would work for the simple case like this, it can grow very complicated or, given the fact that lookbehids are fixed in length, just become impossible at some point. For example, how about "remove newlines everywhere except tags and inside <script> or <pre> "?

The alternate method I often use is to replace things step-by-step: first, remove everything we want to ignore (leaving some 'invisble' markers in text to be able to insert things back), apply transformation to the rest of the text and, finally, restore things we've removed before.

Here's a small class that illustrates this technique

class Clipboard
{
    var $_buf = array();

    function _tr($n) {
        static $dc = "0123456789";
        static $sc = "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19";
        return intval($n) ? strtr($n, $dc, $sc) :
            intval(strtr($n, $sc, $dc));
    }
    //
    function cut($subject, $regexp = null) {
        if(is_array($subject)) {
            $this->_buf[] = $subject[0];
            return "\x01" . $this->_tr(count($this->_buf)) . "\x01";
        }
        return preg_replace_callback($regexp,
            array(&$this, 'cut'), $subject);
    }
    function paste($subject) {
        if(is_array($subject))
            return $this->_buf[$this->_tr($subject[1]) - 1];
        return preg_replace_callback("~\x01([\x10-\x19]+)\x01~",
            array($this, 'paste'), $subject);
    }
}

For example, let's uppercase all text nodes in some html document. First, put together a regexp that represents a html tag.

$html_tag = <<<REGEXP
    ~
        </?\w+
            (
                "[^"]*" |
                '[^']*' |
                [^"'>]+
            )*
        >
    ~sx
REGEXP;

Instantiate a new Clipboard and cut (remove) all tags:

$c = &new Clipboard;
$html = $c->cut($html, $html_tag);

Make the rest uppercase and insert the tags back:

$html = strtoupper($html);
echo $c->paste($html);

For less verbosity, the calls can also be nested:

echo $c->paste(strtoupper($c->cut($html, $html_tag)))

This technique (along with other regexp tricks) is extensively used in makrell.

 
If you think this comment is spam or otherwise completely irrelevant here, feel free to hide it. The comment disappears immediately, though it is not deleted, so I have an option to "unhide" it later.
 

comment on this