Searching One by One, Take Two (CGI Programming with Perl)

#!/usr/bin/perl -wT use strict; use CGI; use CGIBook::Error; my $DOCUMENT_ROOT = $ENV{DOCUMENT_ROOT}; my $VIRTUAL_PATH = ""; my $q = new CGI; my $query = $q->param( "query" ); if ( defined $query and length $query ) { error( $q, "Please specify a valid query!" ); } $query = quotemeta( $query ); my $results = search( $q, $query ); print $q->header( "text/html" ), $q->start_html( "Simple Perl Search" ), $q->h1( "Search for: $query" ), $q->ul( $results || "No matches found" ), $q->end_html; sub search { my( $q, $query ) = @_; my( %matches, @files, @sorted_paths, $results ); local( *DIR, *FILE ); opendir DIR, $DOCUMENT_ROOT or error( $q, "Cannot access search dir!" ); @files = grep { -T "$DOCUMENT_ROOT/$_" } readdir DIR; close DIR; foreach ( @files ) { my $full_path = "$DOCUMENT_ROOT/$_"; open FILE, $full_path or error( $q, "Cannot process $file!" ); while ( <FILE> ) { if ( /$query/io ) { $_ = html_escape( $_ ); s|$query|<B>$query</B>|gio; push @{ $matches{$full_path}{content} }, $_; $matches{$full_path}{file} = $file; $matches{$full_path}{num_matches}++; } } close FILE; } @sorted_paths = sort { $matches{$b}{num_matches} <=> $matches{$a}{num_matches} || $a cmp $b } keys %matches; foreach $full_path ( @sorted_paths ) { my $file = $matches{$full_path}{file}; my $num_matches = $matches{$full_path}{num_matches}; my $link = $q->a( { -href => "$VIRTUAL_PATH/$file" }, $file ); my $content = join $q->br, @{ $matches{$full_path}{content} }; $results .= $q->p( $q->b( $link ) . " ($num_matches matches)" . $q->br . $content ); } return $results; } sub html_escape { my( $text ) = @_; $text =~ s/&/&/g; $text =~ s/</</g; $text =~ s/>/>/g; }

If the line contains a match, we escape characters that could be mistaken for HTML tags. We then bold the matched text, increment the match counter by the number of matches, and push that line onto that file's content array.

After we have finished looking through the files, we sort the results by the number of matches found in decreasing order and then alphabetically by path for those who have the same number of matches.

To generate our results, we walk through our sorted list. For each file, we create a link and display the number of matches and all the lines that matched the query. Since the content exists as individual elements in an array, we join all the elements together into one large string delimited by an HTML break tag.

Now, let us improve on this application a bit by allowing users to specify regular expression searches. We will not present the entire application, since it is very similar to the one we have just covered.

12.2.1. Regex-Based Search Engine

By allowing users to specify regular expressions in their search, we make the search engine much more powerful. For example, a user who wants to search for the recipe for Zwetschgendatschi (a Bavarian plum cake) from your online collection, but is not sure of the exact spelling, could simply enter Zwet.+?chi to find it.

In order to implement this functionality, we have to add several pieces to the search engine.

First, we need to modify the HTML file to provide an option for the user to turn the functionality on or off:

Regex Searching: 
    <INPUT TYPE="radio" NAME="regex" VALUE="on">On
    <INPUT TYPE="radio" NAME="regex" VALUE="off" CHECKED>Off

Then, we need to check for this value in the application and act accordingly. Here is the beginning of the new search script:

#!/usr/bin/perl -wT

use strict;

my $q     = new CGI;
my $regex = $q->param( "regex" );
my $query = $q->param( "query" );

unless ( defined $query and length $query ) {
    error( $q, "Please specify a query!" );
}

if ( $regex eq "on" ) {
    eval { /$query/o };
    error( $q, "Invalid Regex") if $@;
}
else {
    $query = quotemeta $query;
}

my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Perl Regex Search" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;
.
.

The rest of the code remains the same. What we are doing differently here is checking if the user chose the "regex" option and if so, evaluating the user-specified regex at runtime using the eval function. We can check to see whether the regex is invalid by looking at the value stored in $@. Perl sets this variable if there is an error in the evaluated code. If the regex is valid, we can go ahead and use it directly, without quoting the specified metacharacters. If the "regex" option was not requested, we perform the search as before.

As you can see, both of these applications are much improved over the first one, but neither one of them is perfect. Since both of them are based on a linear search algorithm, the search process will be slow when dealing with directories that contain many files. They also search only one directory. They could be modified to recurse down through subdirectories, but that would decrease the performance even more. In the next section, we will look at an index-based approach that calls for creating a dictionary of relevant words in advance, and then searching it rather than the actual files.

12.2. Searching One by One, Take Two

Example 12-2. grep_search2.cgi

12.2.1. Regex-Based Search Engine


12. Searching the Web Server		12.3. Inverted Index Search