How to create highlighted search result excerpts

Overview

The highlight module requires that you have the text of the indexed document available. You can keep the text in a stored field, or if the original text is available in a file, database column, etc, just reload it on the fly. Note that you might need to process the text to remove e.g. HTML tags, wiki markup, etc.

The highlight module works on a pipeline:

  1. Run the text through an analyzer to turn it into a token stream [1].
  2. Break the token stream into “fragments” (there are several different styles of fragmentation available).
  3. Score each fragment based on how many matched query terms the fragment contains.
  4. Format the highest scoring fragments for display.

Footnotes

[1]Some search systems, such as Lucene, can use term vectors to highlight text without retokenizing it. In my tests I found that using a Position/Character term vector didn’t give any speed improvement in Whoosh over retokenizing the text. This probably needs further investigation.

Usage

The high-level interface is the highlight function:

excerpts = highlight(text, terms, analyzer,
                     fragmenter, formatter, top=3,
                     scorer=BasicFragmentScorer, minscore=1,
                     order=FIRST)
text
The original text of the document.
terms
An iterable containing the query words to match, e.g. (“render”, “shader”).
analyzer
The analyzer to use to break the document text into tokens for matching against the query terms. This is usually the analyzer for the field the query terms are in.
fragmenter
A fragmeter callable, see below.
formatter
A formatter callable, see below.
top
The number of fragments to include in the output.
scorer
A scorer callable. The only scorer currently included with Whoosh is BasicFragmentScorer, the default.
minscore
The minimum score a fragment must have to be considered for inclusion.
order
An ordering function that determines the order of the “top” fragments in the output text. This will usually be either SCORE (highest scoring fragments first) or FIRST (highest scoring fragments in their original order). (Whoosh also includes LONGER (longer fragments first) and SHORTER (shorter fragments first) as examples of scoring functions, but they probably aren’t as generally useful.)

Example

How it works

Fragmenters

A fragmenter controls the policy of how to extract excerpts from the original text. It is a callable that takes the original text, the set of terms to match, and the token stream, and returns a sequence of Fragment objects.

The available fragmenters are:

NullFragmenter
Returns the entire text as one “fragment”. This can be useful if you are highlighting a short bit of text and don’t need to fragment it.
SimpleFragmenter
Or maybe “DumbFragmenter”, this just breaks the token stream into equal sized chunks.
SentenceFragmenter
Tries to break the text into fragments based on sentence punctuation (“.”, “!”, and “?”). This object works by looking in the original text for a sentence end as the next character after each token’s ‘endchar’. Can be fooled by e.g. source code, decimals, etc.
ContextFragmenter
This is a “smart” fragmenter that finds matched terms and then pulls in surround text to form fragments. This fragmenter only yields fragments that contain matched terms.

(See the docstrings for how to instantiate these)

Formatters

A formatter contols how the highest scoring fragments are turned into a formatted bit of text for display to the user. It can return anything (e.g. plain text, HTML, a Genshi event stream, a SAX event generater, anything useful to the calling system).

Whoosh currently includes only two formatters, because I wrote this module for myself and that’s all I needed at the time. Unless you happen to be using Genshi also, you’ll probably need to implement your own formatter. I’ll try to add more useful formatters in the future.

UppercaseFormatter
Converts the matched terms to UPPERCASE.
HtmlFormatter
Outputs a string containing HTML tags (with a class attribute) around the the matched terms.
GenshiFormatter
Outputs a Genshi event stream, with the matched terms wrapped in a configurable element.

(See the docstrings for how to instantiate these)

Writing your own formatter

A formatter must be a callable (a function or an object with a __call__ method). It is called with the following arguments:

formatter(text, fragments)
text
The original text.
fragments
An iterable of Fragment objects representing the top scoring fragments.

The Fragment object is a simple object that has attributes containing basic information about the fragment:

Fragment.startchar
The index of the first character of the fragment.
Fragment.endchar
The index of the last character of the fragment.
Fragment.matches
An ordered list of analysis.Token objects representing the matched terms within the fragment.
Fragments.matched_terms
For convenience: A frozenset of the text of the matched terms within the fragment – i.e. frozenset(t.text for t in self.matches).

The basic work you need to do in the formatter is:

  • Take the text of the original document, and pull out the bit between

    Fragment.startchar and Fragment.endchar

  • For each Token object in Fragment.matches, highlight the bits of the

    excerpt between Token.startchar and Token.endchar. (Remember that the character indices refer to the original text, so you need to adjust them for the excerpt.)

The tricky part is that if you’re adding text (e.g. inserting HTML tags into the output), you have to be careful about keeping the character indices straight.

Table Of Contents

Previous topic

About analyzers

Next topic

Query expansion and Key word extraction

This Page