Missing :ignoreaccent in Perl 5

| No Comments | No TrackBacks

So I'm working with natural language processing, and one of the steps in that is tokenization. For those not familiar with the term, it means taking an "expression" and splitting it into the several "parts" of it. In SQL, for instance SELECT * FROM foo WHERE bar='baz' is composed by the tokens 'SELECT', '*', 'FROM', 'foo', 'WHERE', 'bar', '=', '"baz"'. This is not much surprising, a very naive look would say that I could split in word boundaries, but that's just not the case, since the quotes can include spaces inside it, and can include escape characters inside that. But it's still a very simple and very well documented syntax, so tokenization is fairly simple. In fact, you can look at SQL::Tokenizer from the fellow brazillian monk izut.

But unlike SQL, natural languages are everything but simple and well documented, and there are no conformance requirements, because the text is intended for another human being to parse it, not a machine. During this project I realized I was a bit behind in the literature, since I was still using Chomsky as my reference, and the linguistics field already flew apart from that, now accepting that grammars "statistically emerge" from the language use and that a "top-down" approach, which is to define a general grammar model and use it to classify the text is not that much usefull in the long term, since the language use evolve too fast and its variability depending on the media in use, as well as the social environment create conflicting definitions of the grammar. Any resemblance with Perl 6 hability to have custom grammars changing the way the code is parsed is not a mere coincidence.

So, the problem with tokenizing natural language texts, is, at first, the locale problem. In german "ß" would be normalized to "ss", while in greek, it would be "normalized" to "b" and in other languages it should probably be ignored as noise (yes, people often use foreign characters as bullet markers or other decorators). Of course I could do a two-phase tokenization, first doing a naive tokenization to discover the language, then doing a second presuming the locale, but I actually decided to do something alledgedly more clever, which is to ignore the accents when trying to remove the non-important characters, so I can do a simple \W match to remove the non-alpha characters.

The thing I miss in Perl 6 is that you could just use the :ignoreaccent modifier so the match would already match against the base character. What is a simple regex modifier in Perl 6, in Perl 5 needs to be done as:


use strict;
use warnings;
use Text::Unaccent;
use utf8;
use Encode;
use 5.10.0;

my $str = 'têmó�›... åçèñŧos!!!';
my $unac = unac_string('utf8',encode('utf8',$str));
my $d = $unac;
(my $words = $unac) =~ s/(\W)/substr($d,$-[0],1,' ')/ge;

say $str;
say $unac;
say $words;
say $d;

And with that code, I have a much easier task when tokenizing natural languages...

No TrackBacks

TrackBack URL: http://daniel.ruoso.com/cgi-bin/mt/mt-tb.cgi/143

Leave a comment

About this Entry

This page contains a single entry by Daniel Ruoso published on August 30, 2009 1:46 PM.

Transactions and Authorization made simple was the previous entry in this blog.

Fortaleza.PM is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.