Contents:
Introduction
Reading Lines with Continuation Characters
Counting Lines (or Paragraphs or Records) in a File
Processing Every Word in a File
Reading a File Backwards by Line or Paragraph
Trailing a Growing File
Picking a Random Line from a File
Randomizing All Lines
Reading a Particular Line in a File
Processing Variable-Length Text Fields
Removing the Last Line of a File
Processing Binary Files
Using Random-Access I/O
Updating a Random-Access File
Reading a String from a Binary File
Reading Fixed-Length Records
Reading Configuration Files
Testing a File for Trustworthiness
Program: tailwtmp
Program: tctee
Program: laston
The most brilliant decision in all of Unix was the choice of a single character for the newline sequence.
- Mike O'Dell, only half jokingly
Before the Unix Revolution, every kind of data source and destination was inherently different. Getting two programs merely to understand each other required heavy wizardry and the occasional sacrifice of a virgin stack of punch cards to an itinerant mainframe repairman. This computational Tower of Babel made programmers dream of quitting the field to take up a less painful hobby, like autoflagellation.
These days, such cruel and unusual programming is largely behind us. Modern operating systems work hard to provide the illusion that I/O devices, network connections, process control information, other programs, the system console, and even users' terminals are all abstract streams of bytes called files . This lets you easily write programs that don't care where their input came from or where their output goes.
Because programs read and write via byte streams of simple text, every program can communicate with every other program. It is difficult to overstate the power and elegance of this approach. No longer dependent upon troglodyte gnomes with secret tomes of JCL (or COM) incantations, users can now create custom tools from smaller ones by using simple command-line I/O redirection, pipelines, and backticks.
Treating files as unstructured byte streams necessarily governs what you can do with them. You can read and write sequential, fixed-size blocks of data at any location in the file, increasing its size if you write past the current end. Perl uses the standard C I/O library to implement reading and writing of variable-length records like lines, paragraphs, and words.
What can't you do to an unstructured file? Because you can't insert or delete bytes anywhere but at end of file, you can't change the length of, insert, or delete records. An exception is the last record, which you can delete by truncating the file to the end of the previous record. For other modifications, you need to use a temporary file or work with a copy of the file in memory. If you need to do this a lot, a database system may be a better solution than a raw file (see Chapter 14, Database Access ).
The most common files are text files, and the most common operations on text files are reading and writing lines.
Use
<FH>
(or the internal function implementing it,
readline
) to read lines, and use
print
to write them. These functions can also be used to read or write any record that has a specific record separator. Lines are simply records that end in
"\n"
.
The
<FH>
operator returns
undef
on error or when end of the file is reached, so use it in loops like this:
while (defined ($line = <DATAFILE>)) { chomp $line; $size = length $line; print "$size\n"; # output size of line }
Because this is a common operation and that's a lot to type, Perl gives it a shorthand notation. This shorthand reads lines into
$_
instead of
$line
. Many other string operations use
$_
as a default value to operate on, so this is more useful than it may appear at first:
while (<DATAFILE>) { chomp; print length, "\n"; # output size of line }
Call
<FH>
in scalar context to read the next line. Call it in list context to read all remaining lines:
@lines = <DATAFILE>;
Each time
<FH>
reads a record from a filehandle, it increments the special variable
$.
(the "current input record number"). This variable is only reset when
close
is called explicitly, which means that it's not reset when you reopen an already opened filehandle.
Another special variable is
$/
, the input record separator. It is set to
"\n"
, the default end-of-line marker. You can set it to any string you like, for instance
"\0"
to read null-terminated records. Read paragraphs by setting
$/
to the empty string,
""
. This is almost like setting
$/
to
"\n\n"
, in that blank lines function as record separators, but
""
treats two or more consecutive empty lines as a single record separator, whereas
"\n\n"
returns empty records when more than two consecutive empty lines are read. Undefine
$/
to read the rest of the file as one scalar:
undef $/; $whole_file = <FILE>; # 'slurp' mode
The
-0
option to Perl lets you set
$/
from the command line:
% perl -040 -e '$word = <>; print "First word is $word\n";'
The digits after
-0
are the octal value of the single character that
$/
is to be set to. If you specify an illegal value (e.g., with
-0777
) Perl will set
$/
to
undef
. If you specify
-00
, Perl will set
$/
to
""
. The limit of a single octal value means you can't set
$/
to a multibyte string, for instance,
"%%\n"
to read
fortune
files. Instead, you must use a
BEGIN
block:
% perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat
Use
print
to write a line or any other data. The
print
function writes its arguments one after another and doesn't automatically add a line or record terminator by default.
print HANDLE "One", "two", "three"; # "Onetwothree" print "Baa baa black sheep.\n"; # Sent to default output handle
There is no comma between the filehandle and the data to print. If you put a comma in there, Perl gives the error message
"No
comma
allowed
after
filehandle"
. The default output handle is STDOUT. Change it with the
select
function. (See the introduction to
Chapter 7,
File Access
.)
All systems use the virtual
"\n"
to represent a line terminator, called a
newline
. There is no such thing as a newline character. It is an illusion that the operating system, device drivers, C libraries, and Perl all conspire to preserve. Sometimes, this changes the number of characters in the strings you read and write. The conspiracy is revealed in
Recipe 8.11
.
Use the
read
function to read a
fixed-length record. It takes three arguments: a filehandle, a scalar variable, and the number of bytes to read. It returns
undef
if an error occurred or else the number of bytes read. To write a fixed-length record, just use
print
.
$rv = read(HANDLE, $buffer, 4096) or die "Couldn't read from HANDLE : $!\n"; # $rv is the number of bytes read, # $buffer holds the data read
The
truncate
function changes the length of a file, which can be specified as a filehandle or as a filename. It returns true if the file was successfully truncated, false otherwise:
truncate(HANDLE, $length) or die "Couldn't truncate: $!\n"; truncate("/tmp/$$.pid", $length) or die "Couldn't truncate: $!\n";
Each filehandle keeps track of where it is in the file. Reads and writes occur from this point, unless you've specified the
O_APPEND
flag (see
Recipe 7.1
). Fetch the file position for a filehandle with
tell
, and set it with
seek
. Because the stdio library rewrites data to preserve the illusion that
"\n"
is the line terminator, you cannot portably
seek
to offsets calculated by counting characters. Instead, only
seek
to offsets returned by
tell
.
$pos = tell(DATAFILE); print "I'm $pos bytes from the start of DATAFILE.\n";
The
seek
function takes three arguments: the filehandle, the offset (in bytes) to go to, and a numeric argument indicating how to interpret the offset. 0 indicates an offset from the start of the file (the kind of value returned by
tell
); 1, an offset from the current location (a negative number means move backwards in the file, a positive number means move forward); and 2, an offset from end of file.
seek(LOGFILE, 0, 2) or die "Couldn't seek to the end: $!\n"; seek(DATAFILE, $pos, 0) or die "Couldn't seek to $pos: $!\n"; seek(OUT, -20, 1) or die "Couldn't seek back 20 bytes: $!\n";
So far we've been describing buffered I/O. That is,
<FH>
,
print
,
read
,
seek
, and
tell
are all operations that use buffers for speed. Perl also provides unbuffered I/O operations:
sysread
,
syswrite
, and
sysseek
, all discussed in
Chapter 7
.
The
sysread
and
syswrite
functions are different from their
<FH>
and
print
counterparts. They both take a filehandle to act on, a scalar variable to either read into or write out from, and the number of bytes to read or write. They can also take an optional fourth argument, the offset in the scalar variable to start reading or writing at:
$written = syswrite(DATAFILE, $mystring, length($mystring)); die "syswrite failed: $!\n" unless $written == length($mystring); $read = sysread(INFILE, $block, 256, 5); warn "only read $read bytes, not 256" if 256 != $read;
The
syswrite
call sends the contents of
$mystring
to
DATAFILE
. The
sysread
call reads 256 bytes from
INFILE
and stores them 5 characters into
$block
, leaving its first 5 characters intact. Both
sysread
and
syswrite
return the number of bytes transferred, which could be different than the amount of data you were attempting to transfer. Maybe the file didn't have all the data you thought it did, so you got a short read. Maybe the filesystem that the file lives on filled up. Maybe your process was interrupted part of the way through the write. Stdio takes care of finishing the transfer in cases of interruption, but if you use the
sysread
and
syswrite
calls, you must do it yourself. See
Recipe 9.3
for an example of this.
The
sysseek
function doubles as an unbuffered replacement for both
seek
and
tell
. It takes the same arguments as
seek
, but it returns either the new position if successful or
undef
on error. To find the current position within the file:
$pos = sysseek(HANDLE, 0, 1); # don't change position die "Couldn't sysseek: $!\n" unless defined $pos;
These are the basic operations available to you. The art and craft of programming lies in using these basic operations to solve complex problems like finding the number of lines in a file, reversing the order of lines in a file, randomly selecting a line from a file, building an index for a file, and so on.