highlight module

The highlight module contains classes and functions for displaying short excerpts from hit documents in the search results you present to the user, with query terms highlighted.

See how to highlight terms in search results.

Fragmenters

whoosh.highlight.NullFragmenter(text, tokens)
Doesn’t fragment the token stream. This object just returns the entire stream as one “fragment”. This is useful if you want to highlight the entire text.
class whoosh.highlight.SimpleFragmenter(size=70)

Simply splits the text into roughly equal sized chunks.

Parameter:size – size (in characters) to chunk to. The chunking is based on tokens, so the fragments will usually be smaller.
class whoosh.highlight.SentenceFragmenter(maxchars=200, sentencechars='.!?')

Breaks the text up on sentence end punctuation characters (“.”, “!”, or “?”). This object works by looking in the original text for a sentence end as the next character after each token’s ‘endchar’.

When highlighting with this fragmenter, you should use an analyzer that does NOT remove stop words, for example:

sa = StandardAnalyzer(stoplist=None)
Parameter:maxchars – The maximum number of characters allowed in a fragment.
class whoosh.highlight.ContextFragmenter(termset, maxchars=200, surround=20)

Looks for matched terms and aggregates them with their surrounding context.

This fragmenter only yields fragments that contain matched terms.

Parameters:
  • termset – A collection (probably a set or frozenset) containing the terms you want to match to token.text attributes.
  • maxchars – The maximum number of characters allowed in a fragment.
  • surround – The number of extra characters of context to add both before the first matched term and after the last matched term.

Scorers

whoosh.highlight.BasicFragmentScorer(f)

Formatters

class whoosh.highlight.UppercaseFormatter(between='...')

Returns a string in which the matched terms are in UPPERCASE.

Parameter:between – the text to add between fragments.
class whoosh.highlight.HtmlFormatter(tagname='strong', between='...', classname='match', termclass='term', maxclasses=5, attrquote='"')

Returns a string containing HTML formatting around the matched terms.

This formatter wraps matched terms in an HTML element with two class names. The first class name (set with the constructor argument classname) is the same for each match. The second class name (set with the constructor argument termclass is different depending on which term matched. This allows you to give different formatting (for example, different background colors) to the different terms in the excerpt.

>>> hf = HtmlFormatter(tagname="span", classname="match", termclass="term")
>>> hf(mytext, myfragments)
"The <span class="match term0">template</span> <span class="match term1">geometry</span> is..."

This object maintains a dictionary mapping terms to HTML class names (e.g. term0 and term1 above), so that multiple excerpts will use the same class for the same term. If you want to re-use the same HtmlFormatter object with different searches, you should call HtmlFormatter.clear() between searches to clear the mapping.

Parameters:
  • tagname – the tag to wrap around matching terms.
  • between – the text to add between fragments.
  • classname – the class name to add to the elements wrapped around matching terms.
  • termclass – the class name prefix for the second class which is different for each matched term.
  • maxclasses – the maximum number of term classes to produce. This limits the number of classes you have to define in CSS by recycling term class names. For example, if you set maxclasses to 3 and have 5 terms, the 5 terms will use the CSS classes term0, term1, term2, term0, term1.
class whoosh.highlight.GenshiFormatter(qname='strong', between='...')

Returns a Genshi event stream containing HTML formatting around the matched terms.

Parameters:
  • qname – the QName for the tag to wrap around matched terms.
  • between – the text to add between fragments.

Utility classes

class whoosh.highlight.Fragment(tokens, charsbefore=0, charsafter=0, textlen=999999)

Represents a fragment (extract) from a hit document. This object is mainly used to keep track of the start and end points of the fragment; it does not contain the text of the fragment or do much else.

Parameters:
  • tokens – list of the Token objects in the fragment.
  • charsbefore – approx. how many characters before the start of the first matched term to include in the fragment.
  • charsafter – approx. how many characters after the end of the last matched term to include in the fragment.
  • textlen – length in characters of the document text.

Table Of Contents

Previous topic

formats module

Next topic

index module

This Page