|
Home FAQ Documentation Download Contact |
These methods produce the same parse tree. The difference is in the order of construction. In top-down parsing the parse tree is constructed from the top down. In bottom-up parsing the parse tree is constructed from the bottom up. The bottom-up method has more powerful recognition capability because it delays parsing decisions as long as possible. However, since the semantic actions must also be delayed in the bottom-up method, translation/compiling is generally easier with top-down parsing.
Parsing is recursive by nature. Nested constructs like ((x(y))) usually need to be evaluated from the inside out. Recursion handles this process easily by saving partially evaluated constructs on a stack until the innermost parts have been evaluated. If the stack is explicit, the parsing method is referred to as table-driven. If programming language recursion is used, the parsing method is called recursive-descent. In recursive-descent parsing, a subroutine is written for every nonterminal in the grammar. These subroutines call each other in the order specified by the grammar to simulate traversing a parse tree. Most grammars are hierarchical in nature, so descending the grammar from the top down is the parsing procedure used in top-down parsing. Most grammars have repeating constructs, so recursion is necessary. Hence the term recursive-descent parsing.
No. Highly ambiguous languages are difficult to parse using techniques that are primarily deterministic. The GLR parsing method is better in these cases because it was designed specifically to handle ambiguity.
A parser generator is a tool that automates the process of creating parse tables from a grammar. It may also generate parsing, or even compiling code from the tables. In top-down parsing, this can be in the form of parse tables and a driver, or in the form of a recursive-descent program. Bottom-up parsers must be table-driven since there is no bottom-up correlary to programming recursion. An advantage to using a parser generator is that it verifies that the grammar is suitable to the parsing method being used. Programmers rarely take the time to fully analyze the grammar when creating a recursive-descent parser. This can lead to subtle errors that only appear on certain inputs to the program.
This grammar is ambiguous, so it is not LL(k) or LR(k). The problem is that the grammar does not specify operator precedence and associativity, so more than one parse tree can be constructed for some inputs. It is easy to see that symmetrical constructs like E=E can have more than one parse derivation. A bit more subtle is the ambiguity caused by the interaction of E[int] and E=E. The grammar does not specify whether E=E[int] should be parsed as E=(E[int]) or as (E=E)[int]. Since either interpretation is possible, the grammar is ambiguous. Many grammars can be made LL(1) by first removing ambiguity, then removing left-recursion, and finally left-factoring the grammar. One reasonable and unambiguous form of this grammar would be
E -> F = E E -> F F -> F [ int ] F -> G G -> id G -> & id G -> ( E )This gives low precedence to = and higher to []. The grammar makes = right associative, as is usually the case. The first production would be the following if left associativity were required.
E -> E = FRemoving left-recursion results in the following grammar.
E -> F = E E -> F F -> G more_F more_F -> [ int ] more_F more_F -> G -> id G -> & id G -> ( E )Left-factoring results in the following LL(1) grammar.
E -> F E_tail E_tail -> = E E_tail -> F -> G more_F more_F -> [ int ] more_F more_F -> G -> id G -> & id G -> ( E )
Note that once the ambiguity is removed from the grammar, the other transformations to make the grammar LL(1) could be done using a tool like SLK. In the general case, removing ambiguity from a grammar cannot be done deterministically because no algorithm is capable of discerning the intent of the grammar.
LL(k) parsing uses the next k input tokens to make parsing decisions. LL(1) is LL(k) for the case where k=1. LL(k) parsing is considerably more powerful than LL(1) parsing because of the proper superset containment relationship for different k values. For all k>0, the LL(k) languages are a proper superset of the LL(k-1) languages. Note that this is a stronger statement than a similar grammar containment would be. For example, the C programming language is LL(2) but not LL(1), so an LL(1) grammar does not exist for the language.
Extended BNF adds a zero-or-one grouping operator, [], and a zero-or-more, {}, grouping operator to the BNF productions. This shorthand notation makes a grammar more compact. For example,
S -> a A a A -> b A ->can be expressed in EBNF notation as
S -> a [b] a
The use of EBNF notation also tends to avoid the occurrence of some types of grammar conflicts. The only top-down conflicts that can be addressed by EBNF are those that result from the conflict of a null and a non-null production. These conflicts are avoided by substitution in the grammar. The EBNF construct is in effect a new and unique nonterminal that occurs in only one place in the grammar. This removes the possibility of a conflict, since only one continuation is present in the single production. If the same EBNF construct occurs elsewhere, it is a different nonterminal, so no conflict is possible. However, this does not apply to left-recursion and left-factoring conflicts. These cannot be avoided by using EBNF. (EBNF direct left-factoring can be done if grouping and or are supported, but not indirect left-factoring).
Clearly, the same conflict avoidance can be achieved with a fully substituted BNF grammar, so EBNF is not more powerful than BNF notation, at least for LL(k) grammars. Not all conflicts are the result of a null/non-null construct. In the C language, of the nine deterministic LL(1) conflicts, only two are the result of null/non-null productions, and even these two cannot be resolved by using EBNF. All of the conflicts can be resolved using an LL(2) grammar, suggesting that the LL(k) method is preferable to EBNF in the area of conflict resolution.
An interesting consequence of the substitution property of EBNF is that it can make an LL(k) grammar strong LL(k) under the covers. Consider the classic example of a grammar that is LL(2), but not strong LL(2).S -> a A a S -> b A b a A -> b A ->The problem is that "ba" predicts both of the S productions. The left context before the A is needed to resolve the conflict. This grammar is made strong LL(2) by adding a new, duplicate nonterminal A2 in the following way.
S -> a A a S -> b A2 b a A -> b A -> A2 -> b A2 ->Now the left context is not needed because A and A2 are different nonterminals. The equivalent EBNF grammar is also strong LL(2) in its natural expression, without the need to convert a grammar. The LL(2)/strong LL(2) issue never arises at all.
S -> a [b] a S -> b [b] b a
The advantage of EBNF in the case of this grammar is clear.
LL(k) parsing uses all of the left context of the sentence being parsed. This makes it the most powerful top-down deterministic parsing method. The strong LL(k) parsing method does not use any of the left context of the parse. This is somewhat less powerful, but uses much smaller grammars and parsing tables. As noted above, EBNF SLK grammars are likely to be fully LL(k).
It depends on the language to be parsed. If the language is highly ambiguous, a GLR parser is a good choice because the GLR method was designed specifically to handle ambiguity. If the language is very small, and you have not already used a parser generator, hand-coded recursive-descent is probably a good choice. It is possible that you could be finished with the parser in less time than it would take to master the intricacies of a complicated tool like YACC. For larger projects, the time invested in learning to use a parser generator will probably be offset by the time saved in coding and debugging a recursive-descent parser. This is because recursive-descent parsers require considerably more hand-written code than when a parser generator is used.
If a suitable grammar for the language can be found or constructed, a top-down parser generator is probably a better choice than a bottom-up parser generator like YACC. Top-down parser generators are usually easier to use, and also easier to learn to use. They follow the familiar paradigm of recursion. Bottom-up parsing is a little like recursion that is turned inside out, or maybe upside down. It takes some getting used to before it becomes comfortable.
Lexical analysis is easier and lower-level than parsing. The need for a tool is questionable, especially if a template is available. An efficient scanner is necessary to prevent a bottleneck in the overall program. The scanner processes characters one at a time and assembles them into larger tokens for the parser. If this process is not efficient, the effect can be noticeable even though the compiler often does more actual processing. Most tools do produce fairly efficient scanners, so this is usually not a problem. FLEX, FLEX++, JFLEX, and RE2C are some popular scanner generators.
Another reason to hand code the scanner is for flexibility. Often things can be done in the scanner to make an otherwise difficult parsing problem much easier. Some of these techniques cannot be implemented using a generator. One approach to creating a scanner is to use a lex tool for the prototype, and then to hand code a scanner later if necessary.
item|item2:
item
item2
The syntactic conversion can be done automatically by SLK:
slk -ys -f=new_grammar yacc_grammar
The SLK grammar from yacc_grammar will now be in new_grammar.
The new grammar will probably need grammar conversions to put
it in top-down form. SLK can also help with these conversions.
Use the SLK options to remove left recursion and to do left factoring
where needed.
Permutations are usually handled semantically. Use a broad grammar that accepts a superset of what you want, and check for invalid input in the action code. Were both D and E present? Are any phrases repeated? ...
s : A { phrase }+
phrase :
B = "1"
C = "2"
D = "3"
E = "4"
F = "5"
No. Much is made of the "Turing power" of a programming language by those who do not really know parsing theory. The best that a recursive descent parser can be is LL(k). Since only a single lookahead is used by most, they are actually LL(1), which is very weak. The perceived power comes from the fact that the grammar is loosely defined, and not checked for correctness. Like all programs, the parser is only known to be correct for the test suite that is used.
Yes and no. The parser generator itself only deals with integer tokens, not with any kind of text. The scanner converts the input text into a stream of tokens, so it is responsible for handling Unicode. The sample C/C++ scanners are not written to handle Unicode, though they could be modified to do so. The sample Java/C# parsers do handle Unicode using the languages' built-in support.
Any compiler text book is a good source of information.
It is a form of top-down short-circuit backtracking parsing. SLK implements a limited, controlled version of this and calls it predictive backtracking. Backtracking is always inefficient, so it should be used with care. The short-circuit algorithm picks the first syntactically valid parse that it encounters, not trying any others at all. In the case of the C++ expression/declaration ambiguity, this is clearly correct because the language defines it to be so. In other cases, the programmer should be cautious, and understand what he is doing.
It is a limited version of LL(inf) that only backtracks over repeating constructs to resolve conflicts. It helps reduce the need for left factoring in these cases.
Yes, quite a few from very small to very large. They are not listed here mainly because of nondisclosure concerns.