The search system is based on a simple search index that uses 2 tables:
The search index is constructed by finding all content that is unique to a URL (eg. the content objects that are page-specific), stripping out all mark-up, and counting the incidence of each remaining word in the text. For web pages, we also index words in filenames, titles, descriptions, and keywords.
Indexing of a web site consists of:
Make a search object:
my $s = new ExSite::Search;
You must first index your site(s)
before you can perform any searches:
$s->index_site($section);
$section
can be a section ID or a section datahash.
To generate a search form:
my $form_html = $s->search_form($term,$title,$width);
The parameters are all optional. $term
is a term to prepopulate
the search field with. $title
is a title/heading. $width
is
the size of the search field (in characters).
To perform a search on the terms in a search string:
my $results_html = $s->do_search($searchstring);
To get just the list of search hits:
my $results_html = $s->display_results( $s->search($searchstring) );
The Search plug-in provides a simple interface to these functions.
The search system breaks each block of content down to a stream of plain text. All tags and non-text content (such as scripts and CSS) are removed, to leave just the human-readable words and text on the page. Then we strip out all punctuation and other non-word characters to leave just alphanumeric text and whitespace. We convert the text to lower case, and break it out into individual terms, splitting on whitespace. This has a few consequences that may be important for the developer to understand, such as:
Each term is then counted, and the count is multiplied by a weight factor for that content block. The resulting score determines how significant a hit on that term is for that URL.
Search terms can optionally be prefixed with a + or - character, which changes the search rules:
You can combine these for some extra logical control over your searches. For example:
Certain terms can be ignored entirely by the search index. These skipwords are simply not inserted into the index, no matter how often or where they appear. They are ignored in search queries, and attempts to search for just these terms will find nothing.
There are two ways to define the list of skipwords. Method 1 is to
simply list them in the configuration parameter
$config{search}{skipwords}
. You can add to this list using the
configuration file notation:
search.skipwords += foo
search.skipwords += bar
If the search.skipwords
parameter is not an array of works, but is
just a scalar string, that string is understood to be a file
containing the skipwords, one per line. For example:
search.skipwords = skipwords.txt
This file will be sought in the conf
subdirectory of cgi-bin
.
A fairly comprehensive sample file is included with ExSite, containing
over 500 words that by themselves carry little meaning and therefore
do not help to distinguish one search topic from another. This file
may be edited or replaced as needed.
You cannot search for partial words. For example ``surf'' does not match ``surfing''.
Quotes are ignored, and any words in a quoted phrase are searched for individually.
Searches for negative numbers, eg. ``-99'' will be understood to mean ``exclude '99' from the search results''.
It does not index alt tags on images.
It does not index any plug-ins that have not been configured as a service.
Only English skipwords are provided.