You want to make your complex regular expressions understandable and maintainable.
You have four techniques at your disposal: comments outside the pattern, comments inside the pattern with the
/x
modifier, comments inside the replacement part of
s///
, and alternate delimiters.
The piece of sample code in Example 6.1 uses all four techniques. The initial comment describes the overall intent of the regular expression. For relatively simple patterns, this may be all that is needed. More complex patterns, as in the example, will require more documentation.
#!/usr/bin/perl -p # resname - change all "foo.bar.com" style names in the input stream # into "foo.bar.com [204.148.40.9]" (or whatever) instead use Socket; # load inet_addr s{ # ( # capture the hostname in $1 (?: # these parens for grouping only (?! [-_] ) # lookahead for neither underscore nor dash [\w-] + # hostname component \. # and the domain dot ) + # now repeat that whole thing a bunch of times [A-Za-z] # next must be a letter [\w-] + # now trailing domain part ) # end of $1 capture }{ # replace with this: "$1 " . # the original bit, plus a space ( ($addr = gethostbyname($1)) # if we get an addr ? "[" . inet_ntoa($addr) . "]" # format it : "[???]" # else mark dubious ) }gex; # /g for global # /e for execute # /x for nice formatting
For aesthetics, the example uses alternate delimiters. When you split your match or substitution over multiple lines, it helps readability to have matching braces. Another common reason to use alternate delimiters is when your pattern or replacement contains slashes, as in
s/\/\//\/..\//g
, alternate delimiters makes such patterns easier to read, as in
s!//!/../!g
or
s{//}{/../}g
.
The
/x
modifier makes Perl ignore most whitespace in the pattern (it still counts in a bracketed character class) and treat
#
characters and their following text as comments. Although useful, this can prove troublesome if you want literal whitespace or
#
characters in your pattern. If you do want these characters, you'll have to quote them with a backslash, as in the escaped pound signs here:
s/ # replace \# # a pound sign (\w+) # the variable name \# # another pound sign /${$1}/xg; # with the value of the global variable
Remember that comments should explain the text, not just restate the code. Using
"$i++
#
add
one
to
i"
is apt to lose marks in your programming course or get you talked about by your coworkers.
The final technique is
/e
, which evaluates the replacement portion as a full Perl expression, not just as a (double-quote interpolated) string. The result of running this code is used as the replacement string. Because it is evaluated as code, you can put comments in it. This slows your code down somewhat, but not as much as you'd think (until you write a benchmark on your own, a good idea that will allow you to develop a feel for the efficiency of different constructs). That's because the right-hand side of the substitute is syntax-checked and compiled at compile-time along with the rest of your program. This may be overkill in the case of a simple string replacement, but it is marvelous for more complex cases.
Doubling up the
/e
to make
/ee
(or even more, like
/eee
!) is like the
eval
"STRING"
construct. This allows you to use lexical variables instead of globals in the previous replacement example.
s/ # replace \# # a pound sign (\w+) # the variable name \# # another pound sign /'$' . $1/xeeg; # with the value of *any* variable
After a
/ee
substitution, you can test the
$@
variable. It contains any error messages resulting from running your code, because this is real run-time code generation - unlike
/e
.
The
/x
modifier in
perlre
(1) and the
"Pattern Matching"
section of
Chapter 2
of
Programming Perl
; the "Comments Within a Regular Expression" section of Chapter 7 of
Mastering Regular Expressions