Lexical Analysis

Lexical Analysis decomposes the input stream in a sequence of lexical units called tokens. Associated with each token is its attribute which carries the corresponding information. Each time the parser requires a new token, the lexer returns the couple (token, attribute) that matched. When the end of input is reached the lexer returns the couple ('', undef). You don't have to write a lexical analyzer: Parse::Eyapp automatically generates one for you using your %token definitions (file Infix.eyp):

%token NUM   = /([0-9]+(?:\.[0-9]+)?)/
%token PRINT = /print\b/
%token VAR   = /([A-Za-z_][A-Za-z0-9_]*)/

Here the order is important. The regular expression for PRINT will be tried before the regular expression fo VAR. The parenthesis are also important. The lexical analyzer built from the regular expression for VAR returns ('VAR'. $1). Be sure your first memory parenthesis holds the desired attribute.

The lexical analyzer can also be specified through the %lexer directive (see the head section in file InfixWithLexerDirective.eyp of the Parse::Eyapp distribution). The directive %lexer is followed by the code of the lexical analyzer. Inside such code the variable $_ contains the input string. The special variable $self refers to the parser object. The pair ('', undef) is returned by the generated lexer when the end of input is detected.

%lexer  {
 m{\G[ \t]*}gc;
 m{\G(\n)+}gc                    and $self->tokenline($1 =~ tr/\n//);
 m{\G([0-9]+(?:\.[0-9]+)?)}gc    and return ('NUM',   $1);
 m{\Gprint}gc                    and return ('PRINT', 'PRINT');
 m{\G([A-Za-z_][A-Za-z0-9_]*)}gc and return ('VAR',   $1);
 m{\G(.)}gc                      and return ($1,      $1);
}
In the code example above the attribute associated with token NUM is its numerical value and the attribute associated with token VAR is the actual string. Some tokens - like PRINT - do not carry any special information. In such cases, just to keep the protocol simple, the lexer returns the couple (token, token).

When feed it with input b = 1 the lexer will produce the sequence

          (VAR, 'b') ('=', '=') ('NUM', '1') ('', undef)

Lexical analyzers can have a non negligible impact in the overall performance. Ways to speed up this stage can be found in the works of Simoes [5] and Tambouras [6].

Procesadores de Lenguajes 2010-01-31