Daniel Ruoso

Últimos Posts

Categorias

Arquivo

Nuvem de Tags

30.08.2009  Missing :ignoreaccent in Perl 5

So I'm working with natural language processing, and one of the steps in that is tokenization. For those not familiar with the term, it means taking an "expression" and splitting it into the several "parts" of it. In SQL, for instance SELECT * FROM foo WHERE bar='baz' is composed by the tokens 'SELECT', '*', 'FROM', 'foo', 'WHERE', 'bar', '=', '"baz"'. This is not much surprising, a very naive look would say that I could split in word boundaries, but that's just not the case, since the quotes can include spaces inside it, and can include escape characters inside that. But it's still a very simple and very well documented syntax, so tokenization is fairly simple. In fact, you can look at SQL::Tokenizer from the fellow brazillian monk izut.

But unlike SQL, natural languages are everything but simple and well documented, and there are no conformance requirements, because the text is intended for another human being to parse it, not a machine. During this project I realized I was a bit behind in the literature, since I was still using Chomsky as my reference, and the linguistics field already flew apart from that, now accepting that grammars "statistically emerge" from the language use and that a "top-down" approach, which is to define a general grammar model and use it to classify the text is not that much usefull in the long term, since the language use evolve too fast and its variability depending on the media in use, as well as the social environment create conflicting definitions of the grammar. Any resemblance with Perl 6 hability to have custom grammars changing the way the code is parsed is not a mere coincidence.

So, the problem with tokenizing natural language texts, is, at first, the locale problem. In german "ß" would be normalized to "ss", while in greek, it would be "normalized" to "b" and in other languages it should probably be ignored as noise (yes, people often use foreign characters as bullet markers or other decorators). Of course I could do a two-phase tokenization, first doing a naive tokenization to discover the language, then doing a second presuming the locale, but I actually decided to do something alledgedly more clever, which is to ignore the accents when trying to remove the non-important characters, so I can do a simple \W match to remove the non-alpha characters.

The thing I miss in Perl 6 is that you could just use the :ignoreaccent modifier so the match would already match against the base character. What is a simple regex modifier in Perl 6, in Perl 5 needs to be done as:

use strict;
use warnings;
use Text::Unaccent;
use utf8;
use Encode;
use 5.10.0;

my $str = 'têmó�›... åçèñŧos!!!';
my $unac = unac_string('utf8',encode('utf8',$str));
my $d = $unac;
(my $words = $unac) =~ s/(\W)/substr($d,$-[0],1,' ')/ge;

say $str;
say $unac;
say $words;
say $d;

And with that code, I have a much easier task when tokenizing natural languages...

Postado por autor: ruoso em Perl.   Tags  Perlsnippetperl6perl5.

12.08.2009  SMOPP5 first steps

After a long time imagining when this day would come, today Paweł Murias has created a github fork of the perl interpreter so we can start working on the integration of SMOP and perl5.

Some of you might have heard me saying that the major reason for SMOP to exist today is the prospective integration with the perl5 interpreter so we can use Perl 6 at the same time as still being able to use all of CPAN, including the things that depend on XS, like the fantastic Gtk2-Perl suite.

In fact, I've been blocking pmurias on some things like replacing the refcounting by a trace gc in SMOP exactly because that would make SMOP incompatible with perl5, and I really want them to cooperate.

This integration should happen at the deepest level of perlguts, where the perl5 interpreter should play the role of the SMOP interpreter and every SV* is also a SMOP__Object*.

Paweł has added smop/base.h to the p5 repo and I started adding the SMOP__ResponderInterface* member to some p5 values (right now _SV_HEAD, which defines the first members of every SV value, and the PerlInterpreter). This is the first step that will allow SMOP to use P5 objects without the need for a proxy value.

After talking with nothingmuch on #p5p, I decided to note here the first set of goals of the SMOPP5 integration:

  • Making every perl value a SMOP__Object*
  • Implemeting Responder Interfaces for each of this values
  • Implementing the SMOP interpreter and continuation class APIs in the perl5 interpreter (using Coro::State for now)
  • Have SMOP objects visible in perl5 using proxy objects as already happens today

This set gives use the SMOP->P5 integration, after that we're going to need the P5->SMOP integration, which should involve hacking in every p5 macro in the core, which is a *lot* of hacking, so I'll not include it as our goals for now, for sanity sake!

Postado por autor: ruoso em Perl.   Tags  Perlsmopp5smopperl6perl5.

31.07.2009  Too much Perl 6

So, yesterday I was giving a quick perl workshop using Catalyst. The idea was to write a blog in 3 hours. At some point I wrote the following code:

sort { (stat $_)[10] } glob 'posts/*';

And it didn't work, because Perl 5 doesn't DWIM when I have a sort routine that takes only one parameter, while Perl 6 would realize that and use that value for a later sorting. Basically, Perl 6 implemented the schwartzian transform in its core.

Postado por autor: ruoso em Perl.   Tags  Perlperl6perl5.

06.07.2009  Dice Game Perl 6

Following SF, I thought I could present an interesting solution to the dice game as in If you only had one programming language to choose –or– Let the FUD be with you.

SF did rewrite the same algorithm in Perl 6, but I thought I could give a more Perl 6 approach to the problem, leading to the following code:

sub dice($bet, $dice) {
  given $dice {
    when * <=  50 {          0 }
    when * <=  66 {       $bet }
    when * <=  75 { $bet * 1.5 }
    when * <=  99 {   $bet * 2 }
    when * == 100 {   $bet * 3 }
  }
}
sub MAIN($bet, $plays) {
  my $money = 0;
  $money += dice($bet, int(rand() * 100)+1) for ^$plays;
  say "{$bet * $plays}\$ became $money\$ after $plays plays:
     You get {$money / ($bet * $plays)}\$ for a dollar";
}

Let's go through the code step-by-step...

sub MAIN

This is a very handy thing that comes in in Perl 6 natively, if you declare a signature to a specially named subroutine MAIN, this signature will be used as GetOpt instructions, in the code above I asked for two positional arguments, which would mean two parameters:

perl6 dice.pl 40 100

But I could also ask for named parameters and it would require named command-line switches. Very handy.

for ^$plays

The prefix:<^> operator, when used with a number, generates a Range from 0 to that number - 1, so, it would be the same as 0..($plays - 1), but as the number of the play is not important here, it would have the same effect as 1..$plays... Very handy too.

"{$bet * $plays}"

Quotes in Perl 6 are clever, you can open a bracket and type in an expression that will be evaluated.

when * <= 50

This is the Whatever in action. It will generate a closure that will ask for one parameter, when knows about it and sends the "given" value to it. Very handy indeed.

Postado por autor: ruoso em Perl.   Tags  Perlsnippetperl6.

29.06.2009  Perl 6 - The quest for the holy grail

Follows the slides fo the talk I presented at FISL. I liked a lot the reaction of the public, I was afraid of using simplistic examples or to distance people, but the response was very much positive. In general, I saw faces of surprise and satisfaction in the audience. Very good indeed.

Slides: Perl 6 - The quest for the holy grail

Postado por autor: ruoso em Perl.   Tags  fisl10Perlfislslidesperl6.

28.06.2009  FISL10 - Almost YAPC

Getting home after FISL10, which happened in Porto Alegre, where all the free software geeks from Brazil get together to not attending any of the talks, but drinking beer and having lots of fun. This time it wasn't different, besides all the distractions in the environment, including:

  • The Sun booth, which frequently was throwing small balls away with both "Java" and "Mysql" logos on it. The only thing we had to say about it: "it isn't a bad idea being able to kick java and mysql all day".
  • The globo.com booth with the Brasil x South Africa, including all the noice from the crowd around the TV.
  • The lines that were constantly formed in the booths that were serving espressos.
  • Eventual use of speakers to announce some random prize or contest in some random booth.
  • The visit of the president, which kept us away from the user group area an entire day, but didn't stop us from having the Perl booth open on the outside (Flavio Glock has pictures of the process of putting the banner in the light post)
  • One more example of the sexist intent of the ruby community, with three fashion models being casted to walk the entire event using shirts with "We hire rails" (I don't know if there was a pun intended or not, but still, a bad sexist idea).
  • The flacky network infraestructure, lucky of us who had 3g modems from home, hoping no roaming taxes will be charged.

All that being said, I had very much fun with Perl people from all around Brazil. I took part in the cerimony for the launch of a system I've been working for more than a year as free software in the Brazillian Public Software Portal, and I gave a talk titled "Perl 6 - the quest for the holy grail". Each of one deserves a particular post. So that's all for now.

Postado por autor: ruoso em Perl.   Tags  fisl10Perlfislcommunityfree softwareperl6perlmonks.

16.06.2009  A plan for module loading in mildew

Module loading in Perl 6 is considerably different than in Perl 5. The biggest difference is that in Perl 5, you're always in the main namespace and that is global, which means that it doesn't matter what gets some module loaded, if it ends up in main, you'll be able to find it. Like:

# in some random file loaded indirectly...
*{'main::Foo::Bar::new'} = sub {
  return bless {}, 'Foo::Bar';
}
# in your source file
my $a = Foo::Bar->new();

In Perl 6 that is not the case, when you ask for Foo::Bar, it will do a lexical lookup for the package Foo, which means that it won't immediatly look for it in the global namespace. Although all files start parsing in the GLOBAL package, and by default a class is our, which means that

use v6;
# in some random file loaded indirectly
class Foo::Bar {
  ...
}
# in your source file
my $a = GLOBAL::Foo::bar.new(); # this will work
my $b = Foo::Bar.new(); # this won't

That means that, unlike Perl 5 require, module loading in Perl 6 is not just about loading a file, but it also needs to install a symbol in your lexical scope pointing to the loaded module. And to do that Perl 6 has four control statements: “need”, “use”, “require” and “defines”.

  • “need” simply loads something at compile-time without doing any import;

  • “require” is the run-time version of “need”;

  • “defines” explicitly tells to import some module's exports;

  • “use” is a shortcut to “need X defines *”

But that goes a bit beyond that. Consider the following source file, named Sense.pm:

class Sense {
  ...
}
class Nonsense {
  ...
}

When that module is loaded, the two classes are loaded into the comp_unit scope and aliased in the GLOBAL package (as they are “our”). When you

need Sense;

It will load Sense.pm, but it also need to have a local symbol for the “Sense” name, so somehow the implementation of “need” will, at some point, have access to the comp_unit scope of Sense.pm, in order to find a symbol named “Sense”, and alias it back to the current scope. The practical implication to that is that “Nonsense” is not available as a local symbol, but it also tells us a lot about how the loading process work. Basically:

The module loading requires the code causing the load to have access to the recently-loaded-module's outermost lexical scope!

So, it seems that module loading will require one to cache the relationship between the filename and the outermost lexical scope, so the compilation is re-used when some other code “need”s this module. There are some other subtle issues about module loading that I'm going to adress in other posts, but that's it for now.

Postado por autor: ruoso em Perl.   Tags  loadingPerlmildewsmopperl6module.

16.06.2009  RPN Calculator in Perl 6

As the wikipedia says, RPN (Revered Polish Notation), is a kind of calculator that accepts an input which is a bit different than some people is used to. But it is a great way of dealing with long calculations without needing parens grouping.

This notation is very common in finance calculators. Besides the use in math, the problem of implementing a RPN calculator is one of the “Hello World” problems implemented in a lot of programming languages. And, as we are talking about Perl 6, let's get our hands dirty.

Daniel Carrera implemented a version using grammars, but I wanted to make a different version using multi-subs, which is one of the most important features of Perl 6. I'll just throw the code, that you can save as rpn.pl

multi infix:<rpn> (@stack, $num where /^ \d+[\.\d+]? $/) {
  [ @stack, $num ];
}
multi infix:<rpn> (@stack, $op where /^ '+' | '-' | '*' | '/' $/) {
  my $lhs = @stack.pop;
  my $rhs = @stack.pop;
  my $res = do given $op {
    when '+' { $rhs + $lhs }
    when '-' { $rhs - $lhs }
    when '*' { $rhs * $lhs }
    when '/' { $rhs / $lhs }
  }
  [ @stack, $res ]
}
multi infix:<rpn> ($any, $unknown) {
  die "Unkown expression near $any $unknown";
}
say [rpn] [], (~@*ARGS).words;

That being done, considering you already has rakudo installed, you can just run

perl6 rpn.pl "5 4 + 3 / 5 3 - *"

And by that you should have the result “6”. Don't worry if you didn't understand all the details of that code, it uses some new features of Perl 6, some of them don't exist in any other language. If you have experience in some other languages, you might find weird that this code doesn't have any loop, as you would need to iterate all the items of the expression. In fact, this code shows, to all its extent, the functional heritage of Perl 6, and all happens by the “reduce” operator and the signature of each of the candidates of the multi-sub.

infix what?

One of the most unexpected things in this code is the “infix” in the name of the routines. The use of that term indicates that this is not a regular routine, but it is an operator, and that it can be used as any other operator.

What that means is that in Perl 6 all operators are defined in high-level. They are not language primitives, which means you can do:

multi infix:<+> ($a where 2, $b where 2) {
  return 5;
}
say 2 + 2;

And the result will be “5”, because the signature of this candidate is narrower than any other candidate for infix:<+> (Rakudo currently has a bug that prevents this code from working).

But after all, what is “infix”? Well, it defines the kind of the operator, follows a table explaining the types of operators you can create:

Type

How it appears

Examples

How to declare

Infix

$a OP $b

1 + 1, 4 * 4, 'a' x 4

multi infix:<+> {...}

Postfix

$aOP

$a++, 4i

multi postfix:<i> {...}

Prefix

OP$a

++$a, ^$a

multi prefix:<^> {...}

Postcircumfix

$aOP $b OP

$a{$b}, $a[$b]

multi postcircumfix:<{ }> {...}

Circumfix

OP $b OP

( $a )

multi circumifx:<( )> {...}

What that means is that, in fact, that multisubs are declaring a new operator, the “rpn” operator. And, after that declarations, you can simply write:

2 rpn 3

And it would take one of the candidates to execute (in this case it would execute the “die” one). But we could succesfully do the following using the operator:

(([] rpn 2) rpn 3) rpn '+'

Then it would return 5. But fortunally, to our sanity, there is a meta-operator in Perl 6 that does that in a much nicer way, which is “reduce”.

[rpn] [], 2, 3, '+'

The reduce meta-operator will call the infix rpn operator using the same kind of grouping of the previous example, making it all simpler.

The last line

This is a very concise line that says a lot, let's go step by step:

~@*ARGS

The @*ARGS variable contains the parameters sent to the program, and the the prefix ~ operator is used to “stringify” a give value, which in the case of the array, means that it will join all elements with a space. What that menas is that it doesn't matter if you send the entire expression as a single parameter, or each part of the expression as a different parameter, it will work either way..

(~@*ARGS).words

The “word” method in strings, will split this string into a list of words. It's more or less the same thing as using split with space, but as this is so common, there's a specific method to do that.

[], (~@*ARGS).words

That line is composing a new list which will contain an empty array as the first element.

[rpn] [], (~@*ARGS).words

That line will call the reduce meta-operator with the rpn operator, using the list as the argument, and that will reduce the input list to the rpn result, which in the end is sent to the user with “say”.

Postado por autor: ruoso em Perl.   Tags  snippetperl6Perlprogrammingfunctionalmultiinfixcodereduce.