You want to pick out words from a string.
Think long and hard about what you want a word to be and what separates one word from the next, then write a regular expression that embodies your decisions. For example:
/\S+/ # as many non-whitespace bytes as possible /[A-Za-z'-]+/ # as many letters, apostrophes, and hyphens
Because words vary between applications, languages, and input streams, Perl does not have built-in definitions of words. You must make them from character classes and quantifiers yourself, as we did previously. The second pattern is an attempt to recognize
"shepherd's"
and
"sheep-shearing"
each as single words.
Most approaches will have limitations because of the vagaries of written human languages. For instance, although the second pattern successfully identifies
"spank'd"
and
"counter-clockwise"
as words, it will also pull the
"rd"
out of
"23rd
Psalm"
. If you want to be more precise when you pull words out from a string, you can specify the stuff surrounding the word. Normally, this should be a word-boundary, not whitespace:
/\b([A-Za-z]+)\b/ # usually best /\s([A-Za-z]+)\s/ # fails at ends or w/ punctuation
Although Perl provides
\w
, which matches a character that is part of a valid Perl identifier, Perl identifiers are rarely what you think of as words, since we really mean a string of alphanumerics and underscores, but not colons or quotes. Because it's defined in terms of
\w
,
\b
may surprise you if you expect to match an English word boundary (or, even worse, a Swahili word boundary).
\b
and
\B
can still be useful. For example,
/\Bis\B/
matches the string
"is"
only within a word, not at the edges. And while
"thistle"
would be found,
"vis-�-vis"
wouldn't.
The treatment of
\b
,
\w
, and
\s
in
perlre
(1) and in the
"Regular expression bestiary"
section of
Chapter 2
of
Programming Perl
; the words-related patterns in
Recipe 6.23