The highlight module requires that you have the text of the indexed document available. You can keep the text in a stored field, or if the original text is available in a file, database column, etc, just reload it on the fly. Note that you might need to process the text to remove e.g. HTML tags, wiki markup, etc.
The highlight module works on a pipeline:
Footnotes
[1] | Some search systems, such as Lucene, can use term vectors to highlight text without retokenizing it. In my tests I found that using a Position/Character term vector didn’t give any speed improvement in Whoosh over retokenizing the text. This probably needs further investigation. |
The high-level interface is the highlight function:
excerpts = highlight(text, terms, analyzer,
fragmenter, formatter, top=3,
scorer=BasicFragmentScorer, minscore=1,
order=FIRST)
A fragmenter controls the policy of how to extract excerpts from the original text. It is a callable that takes the original text, the set of terms to match, and the token stream, and returns a sequence of Fragment objects.
The available fragmenters are:
(See the docstrings for how to instantiate these)
A formatter contols how the highest scoring fragments are turned into a formatted bit of text for display to the user. It can return anything (e.g. plain text, HTML, a Genshi event stream, a SAX event generater, anything useful to the calling system).
Whoosh currently includes only two formatters, because I wrote this module for myself and that’s all I needed at the time. Unless you happen to be using Genshi also, you’ll probably need to implement your own formatter. I’ll try to add more useful formatters in the future.
(See the docstrings for how to instantiate these)
A formatter must be a callable (a function or an object with a __call__ method). It is called with the following arguments:
formatter(text, fragments)
The Fragment object is a simple object that has attributes containing basic information about the fragment:
The basic work you need to do in the formatter is:
Fragment.startchar and Fragment.endchar
excerpt between Token.startchar and Token.endchar. (Remember that the character indices refer to the original text, so you need to adjust them for the excerpt.)
The tricky part is that if you’re adding text (e.g. inserting HTML tags into the output), you have to be careful about keeping the character indices straight.