You want your function or program to take one or more regular expressions as arguments, but doing so seems to run slower than using literals.
To overcome this bottleneck, if you have only one pattern whose value won't change during the entire run of a program, store it in a string and use
/$pattern/o
.
while ($line = <>) { if ($line =~ /$pattern/o) { # do something } }
If you have more than one pattern, however, that won't work. Use one of the three techniques outlined in the Discussion for a speed-up of an order of magnitude or so.
When Perl compiles a program, it converts patterns into an internal form. This conversion occurs at compile time for patterns without variables in them, but at run time for those that do contain variables. That means that interpolating variables into patterns, as in
/$pattern/
, can slow your program down. This is particularly noticeable when
$pattern
changes often.
The
/o
modifier is a promise from the script's author that the values of any variables interpolated into that pattern will not change - or that if they do, Perl should disregard any such changes. Given such a promise, Perl need only interpolate the variable and compile the pattern the first time it encounters the match. But if the interpolated variable were to change, Perl wouldn't notice. Make sure to use it only on unchanging variables, or else wrong answers will result.
Using
/o
on patterns without interpolated variables does not speed anything up. The
/o
modifier is also of no help when you have an unknown number of regular expressions and need to check one or more strings against all of these patterns. Nor is it of any use when the interpolated variable is a function argument, since each call of the function gives the variable a new value.
Example 6.4
is an example of the slow but straightforward technique for matching many patterns against many lines. The array
@popstates
contains the standard two-letter abbreviations for some of the places in the heartland of North America where we normally refer to soft drinks as
pop
(
soda
to us means either plain soda water or else handmade delicacies from the soda fountain at the corner drugstore, preferably with ice cream). The goal is to print out any line of input that contains any of those places, matching them at word boundaries only. It doesn't use
/o
because the variable that holds the pattern keeps changing.
#!/usr/bin/perl # popgrep1 - grep for abbreviations of places that say "pop" # version 1: slow but obvious way @popstates = qw(CO ON MI WI MN); LINE: while (defined($line = <>)) { for $state (@popstates) { if ($line =~ /\b$state\b/) { print; next LINE; } } }
Such a direct, obvious, brute-force approach is also horribly slow because it has to recompile all patterns with each line of input. Three different ways of addressing this are described in this section. One builds a string of Perl code and
eval
s it; one caches the internal representations of regular expressions in closures; and one uses the Regexp module from CPAN to hold compiled regular expressions.
The traditional way to get Perl to speed up a multiple match is to build up a string containing the code and
eval
"$code"
it.
Example 6.5
contains a version that uses this technique.
#!/usr/bin/perl # popgrep2 - grep for abbreviations of places that say "pop" # version 2: eval strings; fast but hard to quote @popstates = qw(CO ON MI WI MN); $code = 'while (defined($line = <>)) {'; for $state (@popstates) { $code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n"; } $code .= '}'; print "CODE IS\n----\n$code\n----\n" if 0; # turn on to debug eval $code; die if $@;
The
popgrep2
program builds strings like this:
while (defined($line = <>)) { if ($line =~ /\bCO\b/) { print $line; next; } if ($line =~ /\bON\b/) { print $line; next; } if ($line =~ /\bMI\b/) { print $line; next; } if ($line =~ /\bWI\b/) { print $line; next; } if ($line =~ /\bMN\b/) { print $line; next; } }
As you see, those end up looking like constant strings to
eval
. We put the entire loop and pattern match in the
eval
text, too, which makes it run faster.
The worst thing about this
eval
"STRING"
approach is that it's difficult to get the quoting and escaping right. The
dequote
function from
Recipe 1.11
can make it easier to read, but escaping variables whose use is delayed will still be an issue. Also, none of the strings can contain a slash, since that's what we're using as a delimiter for the
m//
operator.
A solution to these problems is a subtle technique first developed by Jeffrey
Friedl. The key here is building an anonymous subroutine that caches the compiled patterns in the closure it creates. To do this, we
eval
a string containing the definition of an anonymous subroutine to match any of the supplied patterns. Perl compiles the pattern once, when the subroutine is defined. The string is evaluated to give you comparatively quick matching ability. An explanation of the algorithm can be found at the end of the section "Regex Compilation, the /o Modifier, and Efficiency" in Chapter 7 of
Mastering Regular Expressions
.
Example 6.6 is a version of our pop grepper that uses that technique.
#!/usr/bin/perl # popgrep3 - grep for abbreviations of places that say "pop" # version 3: use build_match_func algorithm @popstates = qw(CO ON MI WI MN); $expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates); $match_any = eval "sub { $expr }"; die if $@; while (<>) { print if &$match_any; }
The string that gets evaluated ends up looking like this, modulo formatting:
sub { m/\b$popstates[0]\b/o || m/\b$popstates[1]\b/o || m/\b$popstates[2]\b/o || m/\b$popstates[3]\b/o || m/\b$popstates[4]\b/o }
The reference to the
@popstates
array is locked up inside the closure. Each one is different, so the
/o
is safe here.
Example 6.7 is a generalized form of this technique showing how to create functions that return true if any of the patterns match or if all match.
#!/usr/bin/perl # grepauth - print lines that mention both Tom and Nat $multimatch = build_match_all(q/Tom/, q/Nat/); while (<>) { print if &$multimatch; } exit; sub build_match_any { build_match_func('||', @_) } sub build_match_all { build_match_func('&&', @_) } sub build_match_func { my $condition = shift; my @pattern = @_; # must be lexical variable, not dynamic one my $expr = join $condition => map { "m/\$pattern[$_]/o" } (0..$#pattern); my $match_func = eval "sub { local \$_ = shift if \@_; $expr }"; die if $@; # propagate $@; this shouldn't happen! return $match_func; }
Using
eval
"STRING"
on interpolated strings as we did in
popgrep2
is a hack that happens to work. Using lexical variables that get bound up in a closure as in
popgrep3
and the
build_match_*
functions is deep enough magic that even Perl wizards stare at it a while before they believe in it. Of course, it still works whether they believe in it or not.
What you really need is some way to get Perl to compile each pattern once and let you directly refer to the compiled form later on. Such functionality is directly supported in the 5.005 release in the form of a
qr//
regular-expression quoting operator. For prior releases, that's exactly what the experimental Regexp module from CPAN was designed for. Objects created by this module represent compiled regular expression patterns. Using the
match
method on these objects matches the pattern against the string argument. Methods in the class exist for extracting backreferences, determining where pattern matched, and passing flags corresponding to modifiers like
/i
.
Example 6.8 is a version of our program that demonstrates a simple use of this module.
#!/usr/bin/perl # popgrep4 - grep for abbreviations of places that say "pop" # version 4: use Regexp module use Regexp; @popstates = qw(CO ON MI WI MN); @poppats = map { Regexp->new( '\b' . $_ . '\b') } @popstates; while (defined($line = <>)) { for $patobj (@poppats) { print $line if $patobj->match($line); } }
You might wonder about the comparative speeds of these approaches. When run against the 22,000 line text file (the Jargon File, to be exact), version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3 in 0.79 seconds, and version 4 in 1.74 seconds. The last technique is a lot easier to understand than the others, although it does run slightly slower than they do. It's also more flexible.
Interpolation is explained in the "Scalar Value Constructors" section of
perldata
(1), and in the
"String literals"
section of
Chapter 2
of
Programming Perl
; the
/o
modifier in
perlre
(1) and the
"Pattern Matching"
section of
Chapter 2
of
Programming Perl
; the "Regex Compilation, the /o Modifier, and Efficiency" section of Chapter 7 of
Mastering Regular Expressions
; the documentation with the CPAN module Regexp
Copyright © 2001 O'Reilly & Associates. All rights reserved.