Daniel Ruoso

Últimos Posts

Categorias

Arquivo

Nuvem de Tags

30.08.2009  Missing :ignoreaccent in Perl 5

So I'm working with natural language processing, and one of the steps in that is tokenization. For those not familiar with the term, it means taking an "expression" and splitting it into the several "parts" of it. In SQL, for instance SELECT * FROM foo WHERE bar='baz' is composed by the tokens 'SELECT', '*', 'FROM', 'foo', 'WHERE', 'bar', '=', '"baz"'. This is not much surprising, a very naive look would say that I could split in word boundaries, but that's just not the case, since the quotes can include spaces inside it, and can include escape characters inside that. But it's still a very simple and very well documented syntax, so tokenization is fairly simple. In fact, you can look at SQL::Tokenizer from the fellow brazillian monk izut.

But unlike SQL, natural languages are everything but simple and well documented, and there are no conformance requirements, because the text is intended for another human being to parse it, not a machine. During this project I realized I was a bit behind in the literature, since I was still using Chomsky as my reference, and the linguistics field already flew apart from that, now accepting that grammars "statistically emerge" from the language use and that a "top-down" approach, which is to define a general grammar model and use it to classify the text is not that much usefull in the long term, since the language use evolve too fast and its variability depending on the media in use, as well as the social environment create conflicting definitions of the grammar. Any resemblance with Perl 6 hability to have custom grammars changing the way the code is parsed is not a mere coincidence.

So, the problem with tokenizing natural language texts, is, at first, the locale problem. In german "ß" would be normalized to "ss", while in greek, it would be "normalized" to "b" and in other languages it should probably be ignored as noise (yes, people often use foreign characters as bullet markers or other decorators). Of course I could do a two-phase tokenization, first doing a naive tokenization to discover the language, then doing a second presuming the locale, but I actually decided to do something alledgedly more clever, which is to ignore the accents when trying to remove the non-important characters, so I can do a simple \W match to remove the non-alpha characters.

The thing I miss in Perl 6 is that you could just use the :ignoreaccent modifier so the match would already match against the base character. What is a simple regex modifier in Perl 6, in Perl 5 needs to be done as:

use strict;
use warnings;
use Text::Unaccent;
use utf8;
use Encode;
use 5.10.0;

my $str = 'têmó�›... åçèñŧos!!!';
my $unac = unac_string('utf8',encode('utf8',$str));
my $d = $unac;
(my $words = $unac) =~ s/(\W)/substr($d,$-[0],1,' ')/ge;

say $str;
say $unac;
say $words;
say $d;

And with that code, I have a much easier task when tokenizing natural languages...

Postado por autor: ruoso em Perl.   Tags  Perlsnippetperl6perl5.

18.08.2009  Transactions and Authorization made simple

So I really like to follow DRY: Don't Repeat Yourself. In the development of Epitafio (A cemetery management system I mentioned earlier), I was workin on my model classes - note that this is not a DBIC model, but a regular model that do access a DBIC schema - and I realized that for every single method of the models I would need to do two things:

  • Enclose code in a transaction, much like:
    $schema->txn_do(sub { ... })
  • Authorize the user against a specific role:
    die 'Access denied!' unless $user->in_role('foo')

So I started wondering at #catalyst if there would be a pretty way of doing it. I was already using Catalyst::Component::InstancePerContext, but mst quickly guided me to avoid saving the context itself in the object, but rather getting the values I need from there. Since my app models will basically follow this same principle I did a model superclass with:

package Epitafio::Model;
use Moose;
with 'Catalyst::Component::InstancePerContext';
has 'user' => (is => 'rw');
has 'dbic' => (is => 'rw');

sub build_per_context_instance {
  my ($self, $c) = @_;
  $self->new(user => $c->user->obj,
             dbic => $c->model('DB')->schema->restrict_with_object($c->user->obj));
}
1;

Note that I'm still using the C::M::DBIC::Schema as usual, but I'm additionally making a local dbic schema that is restricted according with the logged user. Check DBIx::Class::Schema::RestrictWithObject for details on how that works, and mst++ for the tip.

Ok, now my model classes can know which user is logged in (in a Cat-independent way) as well as have access to the main DBIC::Schema used in the application. Now we just need to DRO - Don't Repeat Ourselves.

Following, again, mst++ tip, I decided against doing a more fancy solution and gone to a plain and simple:

txn_method 'foo' => authorize 'rolename' => sub {
   ...
}

For those who didn't get how that is parsed, this could be rewritten as:

txn_method('foo',authorize('rolename',sub { }))

This works as:

  • authorize receives a role name and a code ref and returns a code ref that does the user role checking before invoking the actual code.
  • txn_method receives the method name and a code ref and installs a new coderef that encloses the given coderef into a transcation in the package namespace as if it were a regular sub definition.

That means you can have a txn_method without authorization, but you would require

our &foo = authorize 'rolename' => sub { ... }

to get authorization without transaction. But as in my application I'll probably have both most of the time, I thought it should suffice the way it is.

But for the txn_method..authorize thing to parse, both subs need to be in the package namespace at BEGIN time, so to solve that, without having to re-type it every time, I wrote a simple Epitafio::ModelUtil module that exports this helpers.

package Epitafio::ModelUtil;
use strict;
use warnings;
use base 'Exporter';

our @EXPORT = qw(txn_method authorized);

sub txn_method {
  my ($name, $code) = @_;
  my $method_name = caller().'::'.$name;
  no strict 'refs';
  *{$method_name} = sub {
    $_[0]->dbic->txn_do($code, @_)
  };
}

sub authorized {
  my ($role, $code) = @_;
  return sub {
    if ($_[0]->user->in_role($role)) {
      $code->(@_);
    } else {
      die 'Access Denied!';
    }
  }
}

1;

And now the code of the model looks just pretty and non-repetitive ;). See the sources for the full version.

Postado por autor: ruoso em Perl.   Tags  epitafioPerlperl5catalyst.

12.08.2009  SMOPP5 first steps

After a long time imagining when this day would come, today Paweł Murias has created a github fork of the perl interpreter so we can start working on the integration of SMOP and perl5.

Some of you might have heard me saying that the major reason for SMOP to exist today is the prospective integration with the perl5 interpreter so we can use Perl 6 at the same time as still being able to use all of CPAN, including the things that depend on XS, like the fantastic Gtk2-Perl suite.

In fact, I've been blocking pmurias on some things like replacing the refcounting by a trace gc in SMOP exactly because that would make SMOP incompatible with perl5, and I really want them to cooperate.

This integration should happen at the deepest level of perlguts, where the perl5 interpreter should play the role of the SMOP interpreter and every SV* is also a SMOP__Object*.

Paweł has added smop/base.h to the p5 repo and I started adding the SMOP__ResponderInterface* member to some p5 values (right now _SV_HEAD, which defines the first members of every SV value, and the PerlInterpreter). This is the first step that will allow SMOP to use P5 objects without the need for a proxy value.

After talking with nothingmuch on #p5p, I decided to note here the first set of goals of the SMOPP5 integration:

  • Making every perl value a SMOP__Object*
  • Implemeting Responder Interfaces for each of this values
  • Implementing the SMOP interpreter and continuation class APIs in the perl5 interpreter (using Coro::State for now)
  • Have SMOP objects visible in perl5 using proxy objects as already happens today

This set gives use the SMOP->P5 integration, after that we're going to need the P5->SMOP integration, which should involve hacking in every p5 macro in the core, which is a *lot* of hacking, so I'll not include it as our goals for now, for sanity sake!

Postado por autor: ruoso em Perl.   Tags  Perlsmopp5smopperl6perl5.

11.08.2009  Far More Than You Ever Wanted To Know About Typeglobs, Closures and Namespaces

This are the slides of a presentation I gave at a tech meeting in Lisbon about 2 years ago. The slides text is in portuguese, but I'm pretty sure they are understandable for non-portuguese speakers too.

Far More Than You Ever Wanted To Know About Typeglobs, Closures and Namespaces

Postado por autor: ruoso em Perl.   Tags  Perlfmtyewtkslidesperl5.