Most of us are used to Internet search engines and social networks capabilities to show only data in certain language, for example, showing only results written in Spanish or English. To achieve that, indexed text must have been analized previously to “guess” the languange and store it together.
There are several ways to do that; probably the most easy to do is a stopwords based approach. The term “stopword” is used in natural language processing to refer words which should be filtered out from text before doing any kind of processing, commonly because this words are little or nothing usefult at all when analyzing text.
How to do that?
Ok, so we have a text whose language we want to detect depending on stopwords being used in such text. First step is to “tokenize” - convert given text to a list of “words” or “tokens” - using an approach or another depending on our requeriments: should we keep contractions or, otherwise, should we split them? we need puntuactions or want to split them off? and so on.
In this case we are going to split all punctuations into separate tokens:
nltk “wordpunct_tokenize” tokenizer
123
>>>fromnltkimportwordpunct_tokenize>>>wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.")['That',"'",'s','thirty','minutes','away','.','I',"'",'ll','be','there','in','ten','.']
As shown, the famous quote from Mr. Wolf has been splitted and now we have “clean” words to match against stopwords list.
At this point we need stopwords for several languages and here is when NLTK comes to handy:
Now we need to compute language probability depending on which stopwords are used:
calculate languages ratios
1234567891011121314
>>>languages_ratios={}>>>>>>tokens=wordpunct_tokenize(text)>>>words=[word.lower()forwordintokens]>>>forlanguageinstopwords.fileids():...stopwords_set=set(stopwords.words(language))...words_set=set(words)...common_elements=words_set.intersection(stopwords_set)......languages_ratios[language]=len(common_elements)# language "score">>>>>>languages_ratios{'swedish':1,'danish':1,'hungarian':2,'finnish':0,'portuguese':0,'german':1,'dutch':1,'french':1,'spanish':0,'norwegian':1,'english':6,'russian':0,'turkish':0,'italian':2}
First we tokenize using wordpunct_tokenize function and lowercase all splitted tokens, then we walk across nltk included languages and count how many unique stopwords are seen in analyzed text to put this in “language_ratios” dictionary.
Finally, we only have to get the “key” with biggest “value”:
So yes, it seems this approach works fine with well written texts - those who respect grammatical rules - (and not so small ones) and is really easy to implement.
Putting it all together
If we put all the explained above into a script we have something like this:
#!/usr/bin/env python#coding:utf-8# Author: Alejandro Nolla - z0mbiehunt3r# Purpose: Example for detecting language using a stopwords based approach# Created: 15/05/13importsystry:fromnltkimportwordpunct_tokenizefromnltk.corpusimportstopwordsexceptImportError:print'[!] You need to install nltk (http://nltk.org/index.html)'#----------------------------------------------------------------------def_calculate_languages_ratios(text):""" Calculate probability of given text to be written in several languages and return a dictionary that looks like {'french': 2, 'spanish': 4, 'english': 0} @param text: Text whose language want to be detected @type text: str @return: Dictionary with languages and unique stopwords seen in analyzed text @rtype: dict """languages_ratios={}''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens >>> wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.") ['That', "'", 's', 'thirty', 'minutes', 'away', '.', 'I', "'", 'll', 'be', 'there', 'in', 'ten', '.'] '''tokens=wordpunct_tokenize(text)words=[word.lower()forwordintokens]# Compute per language included in nltk number of unique stopwords appearing in analyzed textforlanguageinstopwords.fileids():stopwords_set=set(stopwords.words(language))words_set=set(words)common_elements=words_set.intersection(stopwords_set)languages_ratios[language]=len(common_elements)# language "score"returnlanguages_ratios#----------------------------------------------------------------------defdetect_language(text):""" Calculate probability of given text to be written in several languages and return the highest scored. It uses a stopwords based approach, counting how many unique stopwords are seen in analyzed text. @param text: Text whose language want to be detected @type text: str @return: Most scored language guessed @rtype: str """ratios=_calculate_languages_ratios(text)most_rated_language=max(ratios,key=ratios.get)returnmost_rated_languageif__name__=='__main__':text=''' There's a passage I got memorized. Ezekiel 25:17. "The path of the righteous man is beset on all sides\ by the inequities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity\ and good will, shepherds the weak through the valley of the darkness, for he is truly his brother's keeper\ and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger\ those who attempt to poison and destroy My brothers. And you will know I am the Lord when I lay My vengeance\ upon you." Now... I been sayin' that shit for years. And if you ever heard it, that meant your ass. You'd\ be dead right now. I never gave much thought to what it meant. I just thought it was a cold-blooded thing\ to say to a motherfucker before I popped a cap in his ass. But I saw some shit this mornin' made me think\ twice. See, now I'm thinking: maybe it means you're the evil man. And I'm the righteous man. And Mr.\ 9mm here... he's the shepherd protecting my righteous ass in the valley of darkness. Or it could mean\ you're the righteous man and I'm the shepherd and it's the world that's evil and selfish. And I'd like\ that. But that shit ain't the truth. The truth is you're the weak. And I'm the tyranny of evil men.\ But I'm tryin', Ringo. I'm tryin' real hard to be the shepherd. '''language=detect_language(text)printlanguage
There are others ways to “guess” language from a given text like N-Gram-Based text categorization so will see it in, probably, next post.
See you soon and, as always, hope you find it interesting and useful!