analysis module

Classes and functions for turning a piece of text into an indexable stream of “tokens” (usually equivalent to words). There are three general types of classes/functions involved in analysis:

  • Tokenizers are always at the start of the text processing pipeline. They take a string and yield Token objects (actually, the same token object over and over, for performance reasons) corresponding to the tokens (words) in the text.

    Every tokenizer is a callable that takes a string and returns a generator of tokens.

  • Filters take the tokens from the tokenizer and perform various transformations on them. For example, the LowercaseFilter converts all tokens to lowercase, which is usually necessary when indexing regular English text.

    Every filter is a callable that takes a token generator and returns a token generator.

  • Analyzers are convenience functions/classes that “package up” a tokenizer and zero or more filters into a single unit, so you don’t have to construct the tokenizer-filter-filter-etc. pipeline yourself. For example, the StandardAnalyzer combines a RegexTokenizer, LowercaseFilter, and StopFilter.

    Every analyzer is a callable that takes a string and returns a token generator. (So Tokenizers can be used as Analyzers if you don’t need any filtering).

You can implement an analyzer as a custom class or function, or compose tokenizers and filters together using the | character:

my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter()

The first item must be a tokenizer and the rest must be filters (you can’t put a filter first or a tokenizer after the first item).

Analyzers

class whoosh.analysis.Analyzer
Abstract base class for analyzers. Since the analyzer protocol is just __call__, this is pretty simple – it mostly exists to provide common implementations of __repr__ and __eq__.
class whoosh.analysis.IDAnalyzer
Deprecated, just use an IDTokenizer directly, with a LowercaseFilter if desired.
class whoosh.analysis.KeywordAnalyzer

Parses space-separated tokens.

>>> ana = KeywordAnalyzer()
>>> [token.text for token in ana(u"Hello there, this is a TEST")]
[u"Hello", u"there,", u"this", u"is", u"a", u"TEST"]
Parameters:
  • lowercase – whether to lowercase the tokens.
  • commas – if True, items are separated by commas rather than spaces.
class whoosh.analysis.RegexAnalyzer
Deprecated, just use a RegexTokenizer directly.
class whoosh.analysis.SimpleAnalyzer

Composes a RegexTokenizer with a LowercaseFilter.

>>> ana = SimpleAnalyzer()
>>> [token.text for token in ana(u"Hello there, this is a TEST")]
[u"hello", u"there", u"this", u"is", u"a", u"test"]
Parameters:
  • expression – The regular expression pattern to use to extract tokens.
  • gaps – If True, the tokenizer splits on the expression, rather than matching on the expression.
class whoosh.analysis.StemmingAnalyzer

Composes a RegexTokenizer with a lower case filter, an optional stop filter, and a stemming filter.

>>> ana = StemmingAnalyzer()
>>> [token.text for token in ana(u"Testing is testing and testing")]
[u"test", u"test", u"test"]
Parameters:
  • expression – The regular expression pattern to use to extract tokens.
  • stoplist – A list of stop words. Set this to None to disable the stop word filter.
  • minsize – Words smaller than this are removed from the stream.
  • gaps – If True, the tokenizer splits on the expression, rather than matching on the expression.
class whoosh.analysis.StandardAnalyzer

Composes a RegexTokenizer with a LowercaseFilter and optional StopFilter.

>>> ana = StandardAnalyzer()
>>> [token.text for token in ana(u"Testing is testing and testing")]
[u"testing", u"testing", u"testing"]
Parameters:
  • expression – The regular expression pattern to use to extract tokens.
  • stoplist – A list of stop words. Set this to None to disable the stop word filter.
  • minsize – Words smaller than this are removed from the stream.
  • gaps – If True, the tokenizer splits on the expression, rather than matching on the expression.
class whoosh.analysis.FancyAnalyzer

Composes a RegexTokenizer with a CamelFilter, UnderscoreFilter, LowercaseFilter, and StopFilter.

>>> ana = FancyAnalyzer()
>>> [token.text for token in ana(u"Should I call getInt or get_real?")]
[u"should", u"call", u"getInt", u"get", u"int", u"get_real", u"get", u"real"]
Parameters:
  • expression – The regular expression pattern to use to extract tokens.
  • stoplist – A list of stop words. Set this to None to disable the stop word filter.
  • minsize – Words smaller than this are removed from the stream.
  • gaps – If True, the tokenizer splits on the expression, rather than matching on the expression.
class whoosh.analysis.NgramAnalyzer

Composes an NgramTokenizer and a LowercaseFilter.

>>> ana = NgramAnalyzer(4)
>>> [token.text for token in ana(u"hi there")]
[u"hi t", u"i th", u" the", u"ther", u"here"]

Tokenizers

class whoosh.analysis.IDTokenizer

Yields the entire input string as a single token. For use in indexed but untokenized fields, such as a document’s path.

>>> idt = IDTokenizer()
>>> [token.text for token in idt(u"/a/b 123 alpha")]
[u"/a/b 123 alpha"]
class whoosh.analysis.RegexTokenizer(expression='\w+(\.?\w+)*', gaps=False)

Uses a regular expression to extract tokens from text.

>>> rex = RegexTokenizer()
>>> [token.text for token in rex(u"hi there 3.141 big-time under_score")]
[u"hi", u"there", u"3.141", u"big", u"time", u"under_score"]
Parameters:
  • expression – A regular expression object or string. Each match of the expression equals a token. Group 0 (the entire matched text) is used as the text of the token. If you require more complicated handling of the expression match, simply write your own tokenizer.
  • gaps – If True, the tokenizer splits on the expression, rather than matching on the expression.
class whoosh.analysis.CharsetTokenizer(charmap)

Tokenizes and translates text according to a character mapping object. Characters that map to None are considered token break characters. For all other characters the map is used to translate the character. This is useful for case and accent folding.

This tokenizer loops character-by-character and so will likely be much slower than RegexTokenizer.

One way to get a character mapping object is to convert a Sphinx charset table file using whoosh.support.charset.charset_table_to_dict().

>>> from whoosh.support.charset import charset_table_to_dict, default_charset
>>> charmap = charset_table_to_dict(default_charset)
>>> chtokenizer = CharsetTokenizer(charmap)
>>> [t.text for t in chtokenizer(u'Stra\xdfe ABC')]
[u'strase', u'abc']

The Sphinx charset table format is described at http://www.sphinxsearch.com/docs/current.html#conf-charset-table.

Parameter:charmap – a mapping from integer character numbers to unicode characters, as used by the unicode.translate() method.
class whoosh.analysis.SpaceSeparatedTokenizer

Returns a RegexTokenizer that splits tokens by whitespace.

>>> sst = SpaceSeparatedTokenizer()
>>> [token.text for token in sst(u"hi there big-time, what's up")]
[u"hi", u"there", u"big-time,", u"what's", u"up"]
class whoosh.analysis.CommaSeparatedTokenizer

Splits tokens by commas.

Note that the tokenizer calls unicode.strip() on each match of the regular expression.

>>> cst = CommaSeparatedTokenizer()
>>> [token.text for token in cst(u"hi there, what's , up")]
[u"hi there", u"what's", u"up"]
class whoosh.analysis.NgramTokenizer(minsize, maxsize=None)

Splits input text into N-grams instead of words.

>>> ngt = NgramTokenizer(4)
>>> [token.text for token in ngt(u"hi there")]
[u"hi t", u"i th", u" the", u"ther", u"here"]

Note that this tokenizer does NOT use a regular expression to extract words, so the grams emitted by it will contain whitespace, punctuation, etc. You may want to massage the input or add a custom filter to this tokenizer’s output.

Alternatively, if you only want sub-word grams without whitespace, you could combine a RegexTokenizer with NgramFilter instead.

Parameters:
  • minsize – The minimum size of the N-grams.
  • maxsize – The maximum size of the N-grams. If you omit this parameter, maxsize == minsize.

Filters

class whoosh.analysis.PassFilter
An identity filter: passes the tokens through untouched.
class whoosh.analysis.LowercaseFilter

Uses unicode.lower() to lowercase token text.

>>> rext = RegexTokenizer()
>>> stream = rext(u"This is a TEST")
>>> [token.text for token in LowercaseFilter(stream)]
[u"this", u"is", u"a", u"test"]
class whoosh.analysis.UnderscoreFilter

Splits words with underscores into multiple words. This filter is deprecated, use IntraWordFilter instead.

>>> rext = RegexTokenizer()
>>> stream = rext(u"call get_processed_token")
>>> [token.text for token in CamelFilter(stream)]
[u"call", u"get_processed_token", u"get", u"processed", u"token"]

Obviously you should not split words on underscores in the tokenizer if you want to use this filter.

class whoosh.analysis.CharsetFilter(charmap)

Translates the text of tokens by calling unicode.translate() using the supplied character mapping object. This is useful for case and accent folding.

One way to get a character mapping object is to convert a Sphinx charset table file using whoosh.support.charset.charset_table_to_dict().

>>> from whoosh.support.charset import charset_table_to_dict, default_charset
>>> retokenizer = RegexTokenizer()
>>> charmap = charset_table_to_dict(default_charset)
>>> chfilter = CharsetFilter(charmap)
>>> [t.text for t in chfilter(retokenizer(u'Stra\xdfe'))]
[u'strase']

The Sphinx charset table format is described at http://www.sphinxsearch.com/docs/current.html#conf-charset-table.

Parameter:charmap – a mapping from integer character numbers to unicode characters, as required by the unicode.translate() method.
class whoosh.analysis.StopFilter(stoplist=frozenset([, 'and', 'is', 'it', 'an', 'as', 'are', 'in', 'yet', 'if', 'from', 'for', 'when', 'tbd', 'to', 'you', 'be', 'we', 'that', 'may', 'not', 'with', 'by', 'a', 'on', 'this', 'of', 'us', 'will', 'can', 'the', 'or'], ), minsize=2, renumber=True)

Marks “stop” words (words too common to index) in the stream (and by default removes them).

>>> rext = RegexTokenizer()
>>> stream = rext(u"this is a test")
>>> stopper = StopFilter()
>>> [token.text for token in sopper(stream)]
[u"this", u"test"]
Parameters:
  • stoplist – A collection of words to remove from the stream. This is converted to a frozenset. The default is a list of common stop words.
  • minsize – The minimum length of token texts. Tokens with text smaller than this will be stopped.
  • renumber – Change the ‘pos’ attribute of unstopped tokens to reflect their position with the stopped words removed.
  • remove – Whether to remove the stopped words from the stream entirely. This is not normally necessary, since the indexing code will ignore tokens it receives with stopped=True.
class whoosh.analysis.StemFilter(stemfn=<function stem at 0x1922320>, ignore=None)

Stems (removes suffixes from) the text of tokens using the Porter stemming algorithm. Stemming attempts to reduce multiple forms of the same root word (for example, “rendering”, “renders”, “rendered”, etc.) to a single word in the index.

>>> rext = RegexTokenizer()
>>> stream = rext(u"fundamentally willows")
>>> stemmer = StemFilter()
>>> [token.text for token in stemmer(stream)]
[u"fundament", u"willow"]
Parameters:
  • stemfn – the function to use for stemming.
  • ignore – a set/list of words that should not be stemmed. This is converted into a frozenset. If you omit this argument, all tokens are stemmed.
class whoosh.analysis.CamelFilter

Splits CamelCased words into multiple words. This filter is deprecated, use IntraWordFilter instead.

>>> rext = RegexTokenizer()
>>> stream = rext(u"call getProcessedToken")
>>> [token.text for token in CamelFilter(stream)]
[u"call", u"getProcessedToken", u"get", u"Processed", u"Token"]

Obviously this filter needs to precede LowercaseFilter if they are both in a filter chain.

class whoosh.analysis.NgramFilter(minsize, maxsize=None)

Splits token text into N-grams.

>>> rext = RegexTokenizer()
>>> stream = rext(u"hello there")
>>> ngf = NgramFilter(4)
>>> [token.text for token in ngf(stream)]
[u"hell", u"ello", u"ther", u"here"]
Parameters:
  • minsize – The minimum size of the N-grams.
  • maxsize – The maximum size of the N-grams. If you omit this parameter, maxsize == minsize.

Token classes and functions

class whoosh.analysis.Token(positions=False, chars=False, boosts=False, removestops=True, mode='', **kwargs)

Represents a “token” (usually a word) extracted from the source text being indexed.

See “Advanced analysis” in the user guide for more information.

Because object instantiation in Python is slow, tokenizers should create ONE SINGLE Token object and YIELD IT OVER AND OVER, changing the attributes each time.

This trick means that consumers of tokens (i.e. filters) must never try to hold onto the token object between loop iterations, or convert the token generator into a list. Instead, save the attributes between iterations, not the object:

def RemoveDuplicatesFilter(self, stream):
    # Removes duplicate words.
    lasttext = None
    for token in stream:
        # Only yield the token if its text doesn't
        # match the previous token.
        if lasttext != token.text:
            yield token
        lasttext = token.text

...or, call token.copy() to get a copy of the token object.

Parameters:
  • positions – Whether tokens should have the token position in the ‘pos’ attribute.
  • chars – Whether tokens should have character offsets in the ‘startchar’ and ‘endchar’ attributes.
  • boosts – whether the tokens should have per-token boosts in the ‘boost’ attribute.
  • removestops – whether to remove stop words from the stream (if the tokens pass through a stop filter).
  • mode – contains a string describing the purpose for which the analyzer is being called, i.e. ‘index’ or ‘query’.
whoosh.analysis.unstopped(tokenstream)
Removes tokens from a token stream where token.stopped = True.

Table Of Contents

Previous topic

Whoosh API

Next topic

fields module

This Page