('', undef).
You don't have to write a lexical analyzer:
Parse::Eyapp
automatically generates one
for you using your %token definitions
(file Infix.eyp):
%token NUM = /([0-9]+(?:\.[0-9]+)?)/ %token PRINT = /print\b/ %token VAR = /([A-Za-z_][A-Za-z0-9_]*)/
Here the order is important. The regular expression
for PRINT will be tried before the regular expression
fo VAR. The parenthesis are also important. The
lexical analyzer built from the regular expression
for VAR returns ('VAR'. $1). Be sure your first memory
parenthesis holds the desired attribute.
The lexical analyzer can also be specified
through the %lexer directive (see
the head section in file InfixWithLexerDirective.eyp
of the
Parse::Eyapp distribution).
The directive %lexer is followed by the code of the lexical analyzer.
Inside such code the variable $_ contains the input string. The special
variable $self refers to the parser object. The pair ('', undef)
is returned by the generated lexer when the end of input is detected.
%lexer {
m{\G[ \t]*}gc;
m{\G(\n)+}gc and $self->tokenline($1 =~ tr/\n//);
m{\G([0-9]+(?:\.[0-9]+)?)}gc and return ('NUM', $1);
m{\Gprint}gc and return ('PRINT', 'PRINT');
m{\G([A-Za-z_][A-Za-z0-9_]*)}gc and return ('VAR', $1);
m{\G(.)}gc and return ($1, $1);
}
In the code example above
the attribute associated with token NUM
is its numerical value and the attribute associated with
token VAR is the actual string.
Some tokens - like PRINT - do not carry any special
information. In such cases, just to keep the protocol
simple, the lexer returns the couple (token, token).
When feed it with input b = 1 the lexer
will produce the sequence
(VAR, 'b') ('=', '=') ('NUM', '1') ('', undef)
Lexical analyzers can have a non negligible impact in the overall performance. Ways to speed up this stage can be found in the works of Simoes [5] and Tambouras [6].
Procesadores de Lenguajes 2010-01-31