Developers > Content Management > Search System

Search System

ExSite comes with a built-in Search framework that provides simple but flexible searching capabilities. It allows plug-ins to hook into the search system and provide their own content for searching. This document describes how to access the internals of the search system with your code.

The Search Index

It can be costly to trawl diverse sections of the database for search terms in real time, so the ExSite search framework indexes its content in advance, allowing the user to perform efficient searches against the search index.

The search index has two tables:

  • searchurl - every searchable URL. Each URL has a title and description (for presenting search results), and is tied to a specific section. The URL can also be flagged as public or members-only.
  • searchterm - every word found at each searchable URL. Each term has a weight, which is an integer value indicating the importance of this term on this page. The weight will be proportional to how often the term appears on the page, and where in the page it appears (eg. a term in the title is more important than one buried in the body).

The search index is normally assembled by automated tools, but in principle it can be hand-edited, for instance to remove certain URLs or words from the index, or to raise/lower the importance of some words.

We try only to include page-specific content in the search index. General-purpose content (especially words/content from templates, menus, etc.) is skipped, to avoid certain terms generating search hits for every page in a site.

Search Procedure

The basic search procedure is:

  1. the user enters a number of keywords to search for
  2. we look up each of the keywords in the searchterm table, and obtain a set of URLs referencing those keywords
  3. we total the weights of our keywords at each of those URLs to obtain a score for the page. The total score is divided by the number of keywords, to mitigate results that score high on only some keywords. For example, a search for "disk drive" might result in a score of 0 for "disk" and 5 for "drive". The score for the page would be 2.5 ((0 + 5) ÷ 2), whereas a page that scored 3 for "disk" and 3 for "drive" would score 3 overall, and rank higher.
  4. we sort the pages by their score, and filter by the URLs' privacy settings
  5. we report the top N results (where N defaults to 25)

You can perform more complicated searches using the + and - prefixes to your search terms. "+word" means "word" must appear in the results. "-word: means "word" will not be included in the results. So:

"disk drive" means search for pages that contain either word, and weight them accordingly.

"+disk +drive" means search for pages that contain both words.

"-disk drive" means search for pages that contain "drive" but not "disk".

Building the Search Index

You can use the Search plug-in to build search indexes for all of your sites. It does not index your site automatically, but only on request. You should periodically rebuild your search index to ensure that the index is up to date.

You can also use the ExSite::Search library to access the search index tools programatically. Here is a quick recipe to illustrate general usage:

# make a search object
my $search = new ExSite::Search;

# update the index for a whole website
$search->index_site($section_id);

# update the index for a particular web page
# $page can be a page hash, page ID, or an ExSite::Page object
$search->index_page($page)

# commit the search index to the database
$search->update;

The basic search index indexes the default view of each page in the site. If the page contains an embedded plug-in that has self-referential links to dig for more content (eg. news archives, event calendars, etc.), only the "front page" of the plug-in will be indexed. Content that is hidden deeper inside the page will not be indexed. However, plug-ins can be instructed to index "deep" content (see below).

Any words in the file /cgi/conf/skipwords.txt will be excluded from the search index. You can customize the contents of this file, or use a different file of words (one per line) in /conf with the config setting:

search.skipwords = custom_skipwords.txt

Customizing the Search Index

Using the ExSite::Search library, you can modify the search index in a number of ways.

Initialize a new URL in the index before adding any terms under it (only the first three parameters are required):

$search->add_url(
$url, # the URL we are indexing
$section_id, # the website section this URL belongs to
$title, # the URL title
$descrption, # the URL description
$privacy, # "public" or "members-only"
);

Index an arbitrary block of plain text:

$search->index($url,$weight,$text);

$text is a string of plain text to scan and index; $url is the URL at which this text can be viewed; and $weight is the importance of words found in this text (an integer value; 1 is the default, 10 is very high). If your text is HTML, strip it down to plain text before indexing it; ExSite::Misc::html_to_plaintext() may be useful here.

You can add individual words to the index. This may be useful in some cases where the actual word does not appear significantly in the actual page (eg. when the content is primarily images).

$search->index_term($url,$word);          # weight is 1
$search->index_term($url,$word,$weight); # explicit weight

To clear old entries from the index, do this:

$search->clear_site($section_id);

This is done automatically whenever you re-index a site. Lastly, don't forget to update the search database when you are done. ($threshold is optional and defaults to 1; terms whose weight is below the threshold will be discarded from the index.)

$search->update($threshold);

Indexing "Deep" Content in Plug-ins

By default, the search system indexes content on the default view of any given page. With some pages (in particular, those with plug-in generated content), you may need to surf deep into the plug-in to view all of its content (eg. old articles, comment archives, past events, etc.). The plug-in can offer to add such specialized items to the search index. Use the following procedure:

  • The plug-in must be configured to run as a Service on the site.
  • The plug-in requires a service page to handle all requests to the plug-in.
  • The plug-in must respond to the ioctl("Search") request, and return a code reference. This code reference will be called to add deep content to the search index. Three objects will be passed as parameters to this call:
    1. an ExSite::Search object, which you can use to manipulate the search index
    2. an ExSite::Section object, which you can use to find section-specific content
    3. an ExSite::Page object, representing the service page that will be used to display the plug-in content

Example:

Say you have an event calendar plug-in that tracks some information about various events on different dates, and you want all future events to be searchable through the site's main search tool.

Have the plug-in reply to the "Search" ioctl request with the indexing method, eg.

sub ioctl {
my ($this) = shift;
$_ = shift;
if (/Search/) {
return \&search_index;
}
}

Then define the search_index() method to update the search index, as required for this plug-in:

sub search_index {
# $this is our EventCalendar plug-in object; the other parameters are
# objects that are given to us by the search system.
my ($this,$search,$site,$page) = @_;

# we can write some status messages to $out and return these to the
# search system
my $out;

# get a dynamic URL representing the service page
my $url = $page->get_url_dynamic();

# loop over all of the events we want to add to the index
foreach my $event ($this->get_upcoming_events()) {

# add a note to the output
$out .= "==> Indexing '".$event->{name}."'...
";

# setup the required URL parameters to visit this special content
# (details vary between plug-ins, but it usually involves adding
# some extra parameters to the basic dynamic page URL)
my $id = $event->{id};
my $event_url = $url."&event_id=${id}";

# add this new URL to the index
# Give it a special title and description.
$search->add_url($event_url,
$site->id(),
$event->{name},
$event->{description},
"public",
);

# index the event name with a high weight
my $text = &ExSite::Misc::html_to_plaintext($event->{name});
$search->index($event_url,5,$text);

# index the event description with a normal/low weight
$text = &ExSite::Misc::html_to_plaintext($event->{description});
$search->index($event_url,1,$text);
}

# return our status messages to the search system
return $out;
}

This custom plug-in indexing tool will be automatically invoked whenever the main search system indexes the web site. Remember, you must have a service page set up to show the special content in, and the plug-in must be defined as a service, or ExSite will not attempt to index its special content.

The Search Module

The Search plug-in module provides a simple interface to all of the features described in this document.

To build search indexes, launch the Search plug-in from the administrator webtop, and select the appropriate task.

To include a search tool in your web site, embed the search plug-in into a page. No special options are required. If the search tool is placed into the body of the page, no further work is required (other than actually building the search index, if you have not already done so).

Sometimes you want to have a search tool embedded in the frame/wrapper of a page (ie. in the template), so that it is available on all pages. However, you don't want the search results to appear in the same spot in the template; you want the results to show in the body of the page. To do this, set up a service page for the Search plug-in, and include the search tool in the body of this page. You should also remove the search form in the page wrapper for this page. The result is that all searches will be redirected to this page, and all search-related content will appear in the body.

Automatic Indexing

By default, search indexes are only rebuilt when you request it. To have your site automatically reindexed, set the following system configuration parameter:

search.reindex_on_publish = 1

This will update the search index whenever you publish a page or a whole section.  If publishing a page, only the content on that page will be reindexed.  If publishing a whole section, all pages as well as plug-in content will be reindexed.

Topics