You want to use regular expressions on a string containing more than one line, but the special characters
.
(any character but newline),
^
(start of string), and
$
(end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once.
Use
/m
,
/s
, or both as pattern modifiers.
/s
lets
.
match newline (normally it doesn't). If the string had more than one line in it, then
/foo.*bar/s
could match a
"foo"
on one line and a
"bar"
on a following line. This doesn't affect dots in character classes like
[#%.]
, since they are regular periods anyway.
The
/m
modifier lets
^
and
$
match next to a newline.
/^=head[1-7]$/m
would match that pattern not just at the beginning of the record, but anywhere right after a newline as well.
A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. To match across newlines, you need to make
.
match a newline; it ordinarily does not. In cases where newlines are important and you've read more than one line into a string, you'll probably prefer to have
^
and
$
match beginning- and end-of-line, not just beginning- and end-of-string.
The difference between
/m
and
/s
is important:
/m
makes
^
and
$
match next to a newline, while
/s
makes
.
match newlines. You can even use them together - they're not mutually exclusive options.
Example 6.2
creates a filter to strip HTML tags out of each file in
@ARGV
and send the results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because
@ARGV
has several arguments in it. In this case, each read would get a whole file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just
.*
for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using
.*?
in conjunction with
/s
solves these problems - at least in this case.
#!/usr/bin/perl # killtags - very bad html tag killer undef $/; # each read is whole file while (<>) { # get one whole file at a time s/<.*?>//gs; # strip tags (terribly) print; # print file to STDOUT }
Because this is just a single character, it would be much faster to use
s/<[^>]*>//gs,
but that's still a na�ve approach: It doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<
IMG
SRC="here.gif"
ALT="<<Ooh
la
la!>>">
).
Recipe 20.6
explains how to avoid these problems.
Example 6.3
takes a plain text document and looks for lines at the start of paragraphs that look like
"Chapter
20:
Better
Living
Through
Chemisery"
. It wraps these with an appropriate HTML level one header. Because the pattern is relatively complex, we use the
/x
modifier so we can embed whitespace and comments.
#!/usr/bin/perl # headerfy: change certain chapter headers to html $/ = ''; while ( <> ) { # fetch a paragraph s{ \A # start of record ( # capture in $1 Chapter # text string \s+ # mandatory whitespace \d+ # decimal number \s* # optional whitespace : # a real colon . * # anything not a newline till end of line ) }{<H1>$1</H1>}gx; print; }
Here it is as a one-liner from the command line if those extended comments just get in the way of understanding:
% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile
This problem is interesting because we need to be able to specify both start-of-record and end-of-line in the same pattern. We could normally use
^
for start-of-record, but we need
$
to indicate not only end-of-record, but also end-of-line as well. We add the
/m
modifier, which changes both
^
and
$
. So instead of using
^
to match beginning-of-record, we use
\A
instead. (We're not using it here, but in case you're interested, the version of
$
that always matches end-of-record even in the presence of
/m
is
\Z
.)
The following example demonstrates using both
/s
and
/m
together. That's because we want
^
to match the beginning of any line in the paragraph and also want dot to be able to match a newline. (Because they are unrelated, using them together is simply the sum of the parts. If you have the questionable habit of using "single line" as a mnemonic for
/s
and "multiple line" for
/m
, then you may think you can't use them together.) The predefined variable
$.
represents the record number of the last read file. The predefined variable
$ARGV
is the file automatically opened by implicit
<ARGV>
processing.
$/ = ''; # paragraph read mode for readline access while (<ARGV>) { while (m#^START(.*?)^END#sm) { # /s makes . span line boundaries # /m makes ^ match near newlines print "chunk $. in $ARGV has <<$1>>\n"; } }
If you've already committed to using the
/m
modifier, you can use
\A
and
\Z
to get the old meanings of
^
and
$
respectively. But what if you've used the
/s
modifier and want to get the original meaning of
.
? You can use
[^\n]
. If you don't care to use
/s
but want the notion of matching any character, you could construct a character class that matches any one byte, such as
[\000-\377]
or even
[\d\D]
. You can't use
[.\n]
because
.
is not special in a character
class.
The
$/
variable in
perlvar
(1) and in the
"Special Variables"
section of
Chapter 2
of
Programming Perl
; the
/s
and
/m
modifiers in
perlre
(1) and
"the fine print"
section of
Chapter 2
of
Programming Perl
; the "String Anchors" section of
Mastering Regular Expressions
; we talk more about the special variable
$/
in
Chapter 8
Copyright © 2001 O'Reilly & Associates. All rights reserved.