ExSite comes with a built-in Search framework that provides simple but flexible searching capabilities. It allows plug-ins to hook into the search system and provide their own content for searching. This document describes how to access the internals of the search system with your code.
It can be costly to trawl diverse sections of the database for search terms in real time, so the ExSite search framework indexes its content in advance, allowing the user to perform efficient searches against the search index.
The search index has two tables:
The search index is normally assembled by automated tools, but in principle it can be hand-edited, for instance to remove certain URLs or words from the index, or to raise/lower the importance of some words.
We try only to include page-specific content in the search index. General-purpose content (especially words/content from templates, menus, etc.) is skipped, to avoid certain terms generating search hits for every page in a site.
The basic search procedure is:
You can perform more complicated searches using the + and - prefixes to your search terms. "+word" means "word" must appear in the results. "-word: means "word" will not be included in the results. So:
"disk drive" means search for pages that contain either word, and weight them accordingly.
"+disk +drive" means search for pages that contain both words.
"-disk drive" means search for pages that contain "drive" but not "disk".
You can use the Search plug-in to build search indexes for all of your sites. It does not index your site automatically, but only on request. You should periodically rebuild your search index to ensure that the index is up to date.
You can also use the ExSite::Search library to access the search index tools programatically. Here is a quick recipe to illustrate general usage:
# make a search object
my $search = new ExSite::Search;
# update the index for a whole website
$search->index_site($section_id);
# update the index for a particular web page
# $page can be a page hash, page ID, or an ExSite::Page object
$search->index_page($page)
# commit the search index to the database
$search->update;
The basic search index indexes the default view of each page in the site. If the page contains an embedded plug-in that has self-referential links to dig for more content (eg. news archives, event calendars, etc.), only the "front page" of the plug-in will be indexed. Content that is hidden deeper inside the page will not be indexed. However, plug-ins can be instructed to index "deep" content (see below).
Any words in the file /cgi/conf/skipwords.txt will be excluded from the search index. You can customize the contents of this file, or use a different file of words (one per line) in /conf with the config setting:
search.skipwords = custom_skipwords.txt
Using the ExSite::Search library, you can modify the search index in a number of ways.
Initialize a new URL in the index before adding any terms under it (only the first three parameters are required):
$search->add_url(
$url, # the URL we are indexing
$section_id, # the website section this URL belongs to
$title, # the URL title
$descrption, # the URL description
$privacy, # "public" or "members-only"
);
Index an arbitrary block of plain text:
$search->index($url,$weight,$text);
$text is a string of plain text to scan and index; $url is the URL at which this text can be viewed; and $weight is the importance of words found in this text (an integer value; 1 is the default, 10 is very high). If your text is HTML, strip it down to plain text before indexing it; ExSite::Misc::html_to_plaintext() may be useful here.
You can add individual words to the index. This may be useful in some cases where the actual word does not appear significantly in the actual page (eg. when the content is primarily images).
$search->index_term($url,$word); # weight is 1
$search->index_term($url,$word,$weight); # explicit weight
To clear old entries from the index, do this:
$search->clear_site($section_id);
This is done automatically whenever you re-index a site. Lastly, don't forget to update the search database when you are done. ($threshold is optional and defaults to 1; terms whose weight is below the threshold will be discarded from the index.)
$search->update($threshold);
By default, the search system indexes content on the default view of any given page. With some pages (in particular, those with plug-in generated content), you may need to surf deep into the plug-in to view all of its content (eg. old articles, comment archives, past events, etc.). The plug-in can offer to add such specialized items to the search index. Use the following procedure:
Say you have an event calendar plug-in that tracks some information about various events on different dates, and you want all future events to be searchable through the site's main search tool.
Have the plug-in reply to the "Search" ioctl request with the indexing method, eg.
sub ioctl {
my ($this) = shift;
$_ = shift;
if (/Search/) {
return \&search_index;
}
}
Then define the search_index() method to update the search index, as required for this plug-in:
sub search_index {
# $this is our EventCalendar plug-in object; the other parameters are
# objects that are given to us by the search system.
my ($this,$search,$site,$page) = @_;
# we can write some status messages to $out and return these to the
# search system
my $out;
# get a dynamic URL representing the service page
my $url = $page->get_url_dynamic();
# loop over all of the events we want to add to the index
foreach my $event ($this->get_upcoming_events()) {
# add a note to the output
$out .= "==> Indexing '".$event->{name}."'...
";
# setup the required URL parameters to visit this special content
# (details vary between plug-ins, but it usually involves adding
# some extra parameters to the basic dynamic page URL)
my $id = $event->{id};
my $event_url = $url."&event_id=${id}";
# add this new URL to the index
# Give it a special title and description.
$search->add_url($event_url,
$site->id(),
$event->{name},
$event->{description},
"public",
);
# index the event name with a high weight
my $text = &ExSite::Misc::html_to_plaintext($event->{name});
$search->index($event_url,5,$text);
# index the event description with a normal/low weight
$text = &ExSite::Misc::html_to_plaintext($event->{description});
$search->index($event_url,1,$text);
}
# return our status messages to the search system
return $out;
}
This custom plug-in indexing tool will be automatically invoked whenever the main search system indexes the web site. Remember, you must have a service page set up to show the special content in, and the plug-in must be defined as a service, or ExSite will not attempt to index its special content.
The Search plug-in module provides a simple interface to all of the features described in this document.
To build search indexes, launch the Search plug-in from the administrator webtop, and select the appropriate task.
To include a search tool in your web site, embed the search plug-in into a page. No special options are required. If the search tool is placed into the body of the page, no further work is required (other than actually building the search index, if you have not already done so).
Sometimes you want to have a search tool embedded in the frame/wrapper of a page (ie. in the template), so that it is available on all pages. However, you don't want the search results to appear in the same spot in the template; you want the results to show in the body of the page. To do this, set up a service page for the Search plug-in, and include the search tool in the body of this page. You should also remove the search form in the page wrapper for this page. The result is that all searches will be redirected to this page, and all search-related content will appear in the body.
By default, search indexes are only rebuilt when you request it. To have your site automatically reindexed, set the following system configuration parameter:
search.reindex_on_publish = 1
This will update the search index whenever you publish a page or a whole section. If publishing a page, only the content on that page will be reindexed. If publishing a whole section, all pages as well as plug-in content will be reindexed.