Recipe 1.5. Processing a String One Character at a Time (Perl Cookbook)

1.5. Processing a String One Character at a Time

Problem

You want to process a string one character at a time.

Solution

Use split with a null pattern to break up the string into individual characters, or use unpack if you just want their ASCII values:

@array = split(//, $string);  @array = unpack("C*", $string);

Or extract each character in turn with a loop:

    while (/(.)/g) { # . is never a newline here         # do something with $1     }

As we said before, Perl's fundamental unit is the string, not the character. Needing to process anything a character at a time is rare. Usually some kind of higher-level Perl operation, like pattern matching, solves the problem more easily. See, for example, Recipe 7.7 , where a set of substitutions is used to find command-line arguments.

Splitting on a pattern that matches the empty string returns a list of the individual characters in the string. This is a convenient feature when done intentionally, but it's easy to do unintentionally. For instance, /X*/ matches the empty string. Odds are you will find others when you don't mean to.

Here's an example that prints the characters used in the string "an apple a day ", sorted in ascending ASCII order:

%seen = (); $string = "an apple a day"; foreach $byte (split //, $string) {     $seen{$byte}++; } print "unique chars are: ", sort(keys %seen), "\n"; 



unique chars are:  adelnpy

These split and unpack solutions give you an array of characters to work with. If you don't want an array, you can use a pattern match with the /g flag in a while loop, extracting one character at a time:

%seen = (); $string = "an apple a day"; while ($string =~ /(.)/g) {     $seen{$1}++; } print "unique chars are: ", sort(keys %seen), "\n"; 



unique chars are:  adelnpy

In general, if you find yourself doing character-by-character processing, there's probably a better way to go about it. Instead of using index and substr or split and unpack , it might be easier to use a pattern. Instead of computing a 32-bit checksum by hand, as in the next example, the unpack function can compute it far more efficiently.

The following example calculates the checksum of $string with a foreach loop. There are better checksums; this just happens to be the basis of a traditional and computationally easy checksum. See the MD5 module from CPAN if you want a more sound checksum.

$sum = 0; foreach $ascval (unpack("C*", $string)) {     $sum += $ascval; } print "sum is $sum\n"; # prints "1248" if $string was "an apple a day"

This does the same thing, but much faster:

$sum = unpack("%32C*", $string);

This lets us emulate the SysV checksum program:

#!/usr/bin/perl # sum - compute 16-bit checksum of all input files $checksum = 0; while (<>) { $checksum += unpack("%16C*", $_) } $checksum %= (2 ** 16) - 1; print "$checksum\n";

Here's an example of its use:

% perl sum /etc/termcap 



1510

If you have the GNU version of sum , you'll need to call it with the - -sysv option to get the same answer on the same file.

% sum --sysv /etc/termcap 



1510 851 /etc/termcap

Another tiny program that processes its input one character at a time is slowcat , shown in Example 1.1 . The idea here is to pause after each character is printed so you can scroll text before an audience slowly enough that they can read it.

Example 1.1: slowcat

#!/usr/bin/perl # 

slowcat - emulate a   s l o w   line printer # usage: slowcat [-DELAY] [files ...] $DELAY = ($ARGV[0] =~ /^-([.\d]+)/) ? (shift, $1) : 1; $| = 1; while (<>) {     for (split(//)) {         print;         select(undef,undef,undef, 0.005 * $DELAY);     } }

1.5. Processing a String One Character at a Time

Problem

Solution

Discussion

Example 1.1: slowcat

See Also


1.4. Converting Between ASCII Characters and Values		1.6. Reversing a String by Word or Character