start page | rating of books | rating of authors | reviews | copyrights

Book HomeCGI Programming with PerlSearch this book

Chapter 12. Searching the Web Server

Contents:

Searching One by One
Searching One by One, Take Two
Inverted Index Search

Allowing users to search for specific information on your web site is a very important and useful feature, and one that can save them from potential frustration trying to locate particular documents. The concept behind creating a search application is rather trivial: accept a query from the user, check it against a set of documents, and return those that match the specified query. Unfortunately, there are several issues that complicate the matter, the most significant of which is dealing with large document repositories. In such cases, it's not practical to search through each and every document in a linear fashion, much like searching for a needle in a haystack. The solution is to reduce the amount of data we need to search by doing some of the work in advance.

This chapter will teach you how to implement different types of search engines, ranging from the trivial, which search documents on the fly, to the most complex, which are capable of intelligent searches.

12.1. Searching One by One

The very first example that we will look at is rather trivial in that it does not perform the actual search, but passes the query to the fgrep command and processes the results.

Before we go any further, here's the HTML form that we will use to get the information from the user:

<HTML>
<HEAD>
    <TITLE>Simple 'Mindless' Search</TITLE>
</HEAD>
<BODY>
<H1>Are you ready to search?</H1>
<P>
<FORM ACTION="/cgi/grep_search1.cgi" METHOD="GET">
<INPUT TYPE="text" NAME="query" SIZE="20">
<INPUT TYPE="submit" VALUE="GO!">
</FORM>
</BODY>
</HTML>

As we mentioned above, the program is quite simple. It creates a pipe to the fgrep command and passes it the query, as well as options to perform case-insensitive searches and to return the matching filenames without any text. The program beautifies the output from fgrep by converting it to an HTML document and returns it to the browser.

fgrep returns the list of matched files in the following format:

/usr/local/apache/htdocs/how_to_script.html
/usr/local/apache/htdocs/i_need_perl.html
.
.

The program converts this to the following HTML list:

<LI><A HREF="/how_to_script.html" >how_to_script.html</A></LI>
<LI><A HREF="/i_need_perl.html">i_need_perl.html</A></LI>
.
.

Let's look at the program now, as shown in Example 12-1.

Example 12-1. grep_search1.cgi

#!/usr/bin/perl -wT
# WARNING: This code has significant limitations; see description

use strict;
use CGI;
use CGIBook::Error;

# Make the environment safe to call fgrep
BEGIN {
    $ENV{PATH} = "/bin:/usr/bin";
    delete @ENV{ qw( IFS CDPATH ENV BASH_ENV ) };
}

my $FGREP         = "/usr/local/bin/fgrep";
my $DOCUMENT_ROOT = $ENV{DOCUMENT_ROOT};
my $VIRTUAL_PATH  = "";

my $q       = new CGI;
my $query   = $q->param( "query" );

$query =~ s/[^\w ]//g;
$query =~ /([\w ]+)/;
$query = $1;

if ( defined $query ) {
    error( $q, "Please specify a valid query!" );
}

my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Search with fgrep" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;


sub search {
    my( $q, $query ) = @_;
    local *PIPE;
    my $matches = "";
    
    open PIPE, "$FGREP -il '$query' $DOCUMENT_ROOT/* |"
        or die "Cannot open fgrep: $!";
    
    while ( <PIPE> ) {
        chomp;
        s|.*/||;
        $matches .= $q->li(
                        $q->a( { href => "$VIRTUAL_PATH/$_" }, $_ )
                    );
    }
    close PIPE;
    return $matches;
}

We initialize three globals -- $FGREP, $DOCUMENT_ROOT, and $VIRTUAL_PATH -- which store the path to the fgrep binary, the search directory, and the virtual path to that directory, respectively. If you do not want the program to search the web server's top-level document directory, you should change $DOCUMENT_ROOT to reflect the full path of the directory where you want to enable searches. If you do make such a change, you will also need to modify $VIRTUAL_PATH to reflect the URL path to the directory.

Because Perl will pass our fgrep command through a shell, we need to make sure that the query we send it is not going to cause any security problems. Let's decide to allow only "words" (represented in Perl as "a-z", "A-Z", "0-9", and "_") and spaces in the search. We proceed to strip out all characters other than words and spaces and pass the result through an additional regular expression to untaint it. We need to do this extra step because, although we know the substitution really did make the data safe, a substitution is not sufficient to untaint the data for Perl. We could have skipped the substitution and just performed the regular expression match, but it means that if someone entered an invalid character, only that part of their query before the illegal character would be included in the search. By doing the substitution first, we can strip out illegal characters and perform a search on everything else.

After all this, if the query is not provided or is empty, we call our familiar error subroutine to notify the user of the error. We test whether it is defined first to avoid a warning for using an undefined variable.

We open a PIPE to the fgrep command for reading, which is the purpose of the trailing "|". Notice how the syntax is not much different from opening a file. If the pipe succeeds, we can go ahead and read the results from the pipe.

The -il options force fgrep to perform case-insensitive searches and return the filenames (and not the matched lines). We make sure to quote the string in case the user is searching for a multiple word query.

Finally, the last argument to fgrep is a list of all the files that it should search. The shell expands ( globs) the wildcard character into a list of all the files in the specified directory. This can cause problems if the directory contains a large number of files, as some shells have internal glob limits. We will fix this problem in the next section.

The while loop iterates through the results, setting $_ to the current record each time through the loop. We strip the end-of-line character(s) and the directory information so we are left with just the filename. Then we create a list item containing a hypertext link to the item.

Finally, we print out our results.

How would you rate this application? It's a simple search engine and it works well on a small collection of files, but it suffers from a few problems:

So, let's try again and create a better search engine.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.