cut-and-paste technique for complex regexps
It's often desired to transform some text ignoring specific parts of it. The "classical" example is to replace something in html, not touching the tags.
Consider the following:
<span class='foo'>the pic of my class mate</span>
We need to replace "class" with "school". The naive
$html = preg_replace('/class/', 'school', $html);
will obviously break the formatting. We need a more sophisticated expression with lookaround groups. Although it would work for the simple
case like this, it can grow very complicated or, given the fact that lookbehids are fixed in length, just become impossible at some point. For example, how about "remove newlines everywhere except tags and inside <script> or <pre> "?
The alternate method I often use is to replace things step-by-step: first, remove everything we want to ignore (leaving some 'invisble' markers in text to be able to insert things back), apply transformation to the rest of the text and, finally, restore things we've removed before.
Here's a small class that illustrates this technique
class Clipboard
{
var $_buf = array();
function _tr($n) {
static $dc = "0123456789";
static $sc = "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19";
return intval($n) ? strtr($n, $dc, $sc) :
intval(strtr($n, $sc, $dc));
}
//
function cut($subject, $regexp = null) {
if(is_array($subject)) {
$this->_buf[] = $subject[0];
return "\x01" . $this->_tr(count($this->_buf)) . "\x01";
}
return preg_replace_callback($regexp,
array(&$this, 'cut'), $subject);
}
function paste($subject) {
if(is_array($subject))
return $this->_buf[$this->_tr($subject[1]) - 1];
return preg_replace_callback("~\x01([\x10-\x19]+)\x01~",
array($this, 'paste'), $subject);
}
}
For example, let's uppercase all text nodes in some html document. First, put together a regexp that represents a html tag.
$html_tag = <<<REGEXP
~
</?\w+
(
"[^"]*" |
'[^']*' |
[^"'>]+
)*
>
~sx
REGEXP;
Instantiate a new Clipboard and cut (remove) all tags:
$c = &new Clipboard;
$html = $c->cut($html, $html_tag);
Make the rest uppercase and insert the tags back:
$html = strtoupper($html);
echo $c->paste($html);
For less verbosity, the calls can also be nested:
echo $c->paste(strtoupper($c->cut($html, $html_tag)))
This technique (along with other regexp tricks) is extensively used in makrell.
Peter Goodman :
In a templating engine that I've worked on (that is WACT-like) that doesn't use macros, I've found getting at tags and the data between then using preg_split to be quite handy. (Although you are surely aware of this)
Is what you're doing right now essentially delimiting the text you want to modify with x01? Or, are you tokenizing what you want to ignore/modify into $_buf?
comment on this