N-Gram-Based Text Categorization: Categorizing text with python

Introduction

As we saw in last post it’s really easy to detect text language using an analysis of stopwords. Another way to detect language, or when syntax rules are not being followed, is using N-Gram-Based text categorization (useful also for identifying the topic of the text and not just language) as William B. Cavnar and John M. Trenkle wrote in 1994 so i decided to mess around a bit and did ngrambased-textcategorizer in python as a proof of concept.

How N-Gram-Based Text Categorization works

To perform N-Gram-Based Text Categorization we need to compute N-grams (with N=1 to 5) for each word - and apostrophes - found in the text, doing something like (being the word “TEXT”):

bi-grams: _T, TE, EX, XT, T_
tri-grams: _TE, TEX, EXT, XT_, T_ _
quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

When every N-Gram has been computed we just keep top 300 - William and John observed this range as proper for language detection and starting around 300 for subject categorization - and save them as a “text category profile”. For categorize a text we only have to make same steps and calculate the “Out-Of-Place” measure against pre-computed profiles:

Then, choose the nearest one - the one with lower distance - among them.

Implementing it in python

Procedure to create a text category profile is well explained at point “3.1 Generating N-Gram Frequency Profiles” and it’s really easy to implement it in python with the help of powerful nltk toolkit.

First step: Split text into tokens (tokenization)

We need to tokenize text splitting by strings of only letters and apostrophes so we could use nltk RegexpTokenizer for this:

tokenizing by regular expression

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
>>> tokenizer.tokenize("Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves.")
['Le', 'temps', 'est', 'un', 'grand', 'ma\xc3\xaetre', 'dit', 'on', 'le', 'malheur', 'est', "qu'il", 'tue', 'ses', '\xc3\xa9l\xc3\xa8ves']

It’s just a proof of concept and should be tuned, but will be enough for now.

Second step: Generating N-grams for each token

Now it’s time to generate n-grams (with N=1 to 5) using blank as padding, again nltk has a function for it:

generating ngrams with nltk

>>> from nltk.util import ngrams
>>> generated_ngrams = ngrams('TEXT', 4, pad_left=True, pad_right=True, pad_symbol=' ')
>>> generated_ngrams[4]
('E', 'X', 'T', ' ')

ngram function will return a tuple so we need to join positions in ngrams itself:

join splitted char into a ngram

>>> ''.join(generated_ngrams[4])
'EXT '

Third step: Hashing into a table and countig each N-gram occurrences

The easiest way to do this is using a python dictionary, doing a sum when ngram has been seen before or creating a new key otherwise:

counting ngrams occurrences

ngrams_statistics = {}

for ngram in ngrams:
  if not ngrams_statistics.has_key(ngram):
      ngrams_statistics.update({ngram:1})
  else:
      ngram_occurrences = ngrams_statistics[ngram]
      ngrams_statistics.update({ngram:ngram_occurrences+1})

Fourth step: Sort in reverse order by number of occurrences

By last, we need to sort previously created dictionary in reverse order based on each ngram occurrences to keep just top 300 most repeated ngrams. Python dict’s can’t be sorted, so we need to transform it to a sorted list, we can easily achieve it using operator module:

converting a dictionary to a sorted list

>>> ngrams_statistics_sorted = sorted(ngrams_statistics.iteritems(),\
...                             key=operator.itemgetter(1),\
...                             reverse=True)[0:300]
>>> ngrams_statistics_sorted[80:90]
[('y', 23994), ('  p', 23941), (' p', 23941), ('   p', 23941), ('    p', 23941), ('b', 23809), ('r ', 23783), ('r    ', 23783), ('r   ', 23783), ('r  ', 23783)]

Now it only remains to save “ngrams_statistics_sorted” to a file as a “text category profile” or keep just ngrams without occurrences sum when comparing them against others profiles.

Comparing profiles

To categorize a text first we need to load pre-computed categories into a list/dict or something similar and, when loaded, walk it and calculate distance with each previously computed profile:

computing ratios for each pre-computed profile

>>> for language, ngrams_statistics in self._languages_statistics.iteritems():
...    language_ngram_statistics = self._calculate_ngram_occurrences(raw_text)
...    distance = self._compare_ngram_frequency_profiles(ngrams_statistics, language_ngram_statistics)
...    languages_ratios.update({language:distance})
>>>   
>>> languages_ratios
{'spanish': 39960, 'french': 40398, 'english': 40418}
>>> # extracting the one with lowest distance
>>> min(languages_ratios, key=languages_ratios.get)
'spanish'

Conclusions

N-Gram-Based text categorization is probably not the “state-of-art” in text categorization - almost ten years old and a bit simple compared with newer ways of categorizing text - but it could be useful in some situations and as a basis to build upon and, what the heck, i learned doing it and had great time, so it totally worth it to me ;)

See you soon!

Alejandro Nolla