If you don't understand this first read about EBNF.
Non terminals
PROGRAM ::= DECLARATIONS STATEMENTS
It's not that difficult to allow declarations everywhere in the code, it's just that I prefer to have it at the top. This way it's easy to find the declaration when you are looking at the code.
DECLARATIONS ::= ((DECLARATION|FUNCTION) SEMICOLON)*
Well ok, this allows you to mix variable declarations and function declarations, but that's up to you. You might prefer to declare vars first and then function. This gives you the freedom.
Note that this rule may be empty. I use the semicolon as a divider of parts, and it's checked here and not in the declaration/function because it's part of the block, seperating the elements.
DECLARATION ::= VAR TYPE IDENTIFIER
C style variable declaration, but prefixed with a var keyword. This is to make it a bit easier for the parser.
TYPE ::= INTEGER | STRING | FLOAT | BOOLEAN
We've got four types, I've not thought about custom type declarations.
IDENTIFIER ::= __implicit__
Identifiers are returned by the tokenizer so I haven't added it to the language specification.
FUNCTION ::= FUNC TYPE IDENTIFIER LBRACK ARGUMENTS RBRACK DECLARATIONS BEGIN STATEMENTS END
A function declaration including body. You always have to return a type, this is used to declare the always present local variable result (just like in Pascal functions). Before the function body you can define local variables (and even functions). The function body is encapsued with a begin and end just like in Pascal. I like this way, ok it's more to write but it makes to code more readable, ok I just like it since I can easily replace it with { and } or whatever I want.
ARGUMENTS ::= (TYPE LVALUE (SEMICOLON TYPE LVALUE)*)?
Function arguments are optional, and to prevent you from adding a semicolon after the last argument I used this construction. It's doesn't matter for the resulting AST, just the parser.
STATEMENTS ::= (STATEMENT SEMICOLON)*
Zero or more statements, yes it's possible to define a program without any code to be executed, this might be usefull for function declarations to be reused or something.
STATEMENT ::= ASSIGNMENT | WHILEDO | IFTHENELSE
A statement can be everything, an assignment or a control block like while ... do or if ... then ... else. It's easy to add new kind of statements here.
ASSIGNMENT ::= IDENTIFIER BECOMES EXPR | FUNCTIONCALL
Well this is the reason why BruteForce is no longer LL(1), because I have to look a head to see if it's an assignment or an function call. Both start with an identifier. I could work around it but this would require me to add a keyword to one of them. For example call function_name()
WHILEDO ::= WHILE EXPR DO CODEBLOCK CODEBLOCK ::= STATEMENT | (BEGIN STATEMENTS END)
The while and if control blocks accept code blocks this is either one statement or more, because more statements qould require to use the semicolon to divide them a code block will start with begin and end with end .
IFTHENELSE ::= IF EXPR THEN CODEBLOCK (ELSE CODEBLOCK)?
I've made the else part optional, empty else parts are just ugly and it can easily be avoided since it's a diffirence between a semicolon and else
EXPR ::= BOOLEX
Here's an nice example of a pretty useless rule, but maybe when I want to extend expression with something that has a lower precendence than a boolean expression I would only need to update one rule instead of a lot.
The following rules define the operator precedance, the sooner the lower the priority. The definition below allows constructs like:
1+2+3+4+5 1*2+3/5<=4+1
BOOLEX ::= ACCUM ((LT|LE|GT|GE|EQ|NE) ACCUM)* ACCUM ::= MULT ((PLUS|MINUS) MULT)* MULT ::= PREOP ((MULTIPLY|DIVIDE) PREOP)* PREOP ::= (MINUS|NOT)? OPERAND OPERAND ::= IDENTIFIER | FUNCTIONCALL | INTVALUE | FLOATVALUE | STRINGVALUE | BOOLVALUE | LBRACK EXPR RBRACK
An operant is either a constant value, a variable identifier, a function call or again an expression between braches. These expressions between braces are very important, it allows you to overrule the operator precedance. You won't see these braces in the AST anymore because they are no longer needed in there, the operators are already set in the correct order.
FUNCTIONCALL ::= IDENTIFIER LBRACK (EXPR (COMMA EXPR)*)? RBRACK
The arguments of a function call are expressions, this allows you to use a constant or expression besides the normal identifiers.
INTVALUE ::= __implicit__ FLOATVALUE ::= __implicit__ STRINGVALUE ::= __implicit__ BOOLVALUE ::= TRUE|FALSE
Boolean values are returned as an identifier by the tokenizer, so we need to filter those out manualy.
Terminals
BEGIN ::= 'begin' END ::= 'end' SEMICOLON ::= ';' VAR ::= 'var' INTEGER ::= 'int' STRING ::= 'string' FLOAT ::= 'float' BOOLEAN ::= 'bool' FUNC ::= 'function' BECOMES ::= '=' WHILE ::= 'while' DO ::= 'do' IF ::= 'if' THEN ::= 'then' ELSE ::= 'else' LT ::= '<' LE ::= '<=' GT ::= '>' GE ::= '>=' EQ ::= '==' NE ::= '!=' PLUS ::= '+' MINUS ::= '-' MULTIPLY ::= '*' DIVIDE ::= '/' MOD ::= '%' NOT ::= '!' LBRACK ::= '(' RBRACK ::= ')' TRUE ::= 'true' FALSE ::= 'false' COMMA ::= ','
Sample code
Here's a piece of example code that will print all easters dates from 2003 to 2012, unless you provide diffirent values on the input
var int y; var int ey; function string easter(int year) var int m; var int d; var int g; var int c; var int x; var int z; var int b; var int e; begin g = year % 19 + 1; c = (year / 100) + 1; x = (3*c / 4) - 12; z = ((8*c + 5) / 25) - 5; b = (5*year / 4) - x - 10; e = (11*g + 20 + z - x); e = e % 30; if e < 0 then e = e+30; if (((e == 25)+(g>11) == 2) + (e==24)) > 0 then e = e+1; d = 44-e; if d < 21 then d = d+30; d = d+7-((b+d) - ((b+d)/7)*7); if d>31 then m = 4 else m = 3; if d>31 then d = d-31; result = ""+year+"-"+m+"-"+d; end; y = 2003; ey = 2012; if (argc() == 1) then ey = Int(argv(0)); if (argc() == 2) then begin y = Int(argv(0)); ey = Int(argv(1)); end; print("Day of Easter for "+y+"-"+ey+"."); while (y <= ey) do begin print(easter(y)); y = y+1; end;