Skip to content


A Prolog Lexer Generator, Part 1: Love (of DCGs) Is Not Enough

Edit (8-26-2010): After receiving feedback on this post, I have added a description of DCGs for those with a passing familiarity with Prolog but who are not familiar with DCGs themselves.

As a side project, I’ve been tinkering with implementing a lexer generator for Prolog. To that end, I figure I’ll blog about my work as it progresses (and as I have time), essentially creating a development log of sorts for the project. So to start things off, I’ll describe some of the rationale behind my decision to undertake this project.

It may seem somewhat unnecessary to have an actual lexer generator for Prolog. After all, DCGs are built into most Prolog implementations, and we can encode a lexer as a DCG; in fact, we don’t really need a lexer, since we can handle terminals in the DCG parser, right?  What is the need to have a separate system? So as a first post on the topic, let’s examine whether DCGs are sufficient to develop lexer generators.

What are DCGs?

Before we answer this question, let’s walk through a brief overview of DCGs, for benefit of those who have a passing familiarity with Prolog but who have not seen this extension of the language.

DCG is short for Definite Clause Grammar. In pure Prolog, all clauses (e.g., rules of the form

1
p(X) :- q(X), r(X).

) are called “Definite Clauses”; thus, a Definite Clause Grammar is a grammar whose production rules are expressed as definite clauses. Rather than using the normal

1
:-

“neck” operator, DCGs add a new neck operator, called

1
-->

;. This signals the Prolog compiler to treat the clause as a DCG clause, rather than a traditional Prolog clause. Goals (i.e., calls to predicates) in the body of a DCG clause can take three forms:

  1. A regular Prolog predicate call, such as
    1
    p(X)

    or

    1
    rule2
  2. A list of atoms, such as
    1
    ['a']

    ,

    1
    ['one', 'two', '3']

    , etc.

  3. A sequence of Prolog predicate calls wrapped in curly brackets, such as
    1
    {p(X), q(X), r(Y)}

For a DCG clause, the Prolog compiler takes the head (the predicate name and arguments on the left-hand side of the

1
-->

) and each of the goals in the body and modifies them as follows:

  1. The head and the regular Prolog predicate calls have two arguments added to the end of the call. The first argument is a list of incoming atoms, while the second is the same as the first list except all atoms at the beginning of the list that are matched by the rule are removed. So the outgoing argument of one call is the incoming argument of the next call. The head’s arguments are linked to the first and last calls in the body so the incoming argument of the head is the incoming argument of the first call, and the outgoing argument is the outgoing argument of the last call.
  2. Lists of atoms are transformed to a call to a special predicate that succeeds when the list of terminals is the prefix of the incoming list; the outgoing list is the same as the incoming list with the prefixed terminals removed.
  3. A sequence of calls inside curly brackets are copied verbatim to the body with no modification.

There are a few other minor things that the compiler will do in special cases, but this covers the majority of the time. An example for the traditional a^n b^n grammar:

1
2
s --> ['a'], s, ['b'].
s --> ['a'], ['b'].

The Usual Sentences

Though I’ve heard the “all you need is DCGs” attitude explicitly a few times, I’ve seen it implicitly much more often, starting with the canonical example used in literature and lectures on the subject of DCGs — “sentence = noun phrase + verb phrase.” I’ve seen this example (almost) everywhere when DCGs are brought up: my Prolog-related lectures here at UTD (a similar lecture example is available here),  the Prolog Wikibook chapter on DCGs, the Wikipedia DCG page, the Amzi! Prolog manual, the “Logic Grammars” chapter of The Art of Prolog, and so on. In all of these examples, the tokens are hard-coded into the DCG grammar, with little to no discussion of writing a lexer for the grammar in sight. The Art of Prolog provides perhaps the worst gloss-over of the impact of lexical analysis in its final chapter[1], which is tasked with writing a compiler for a simplified Pascal dialect. When introducing the compiler, the authors write

Both the first stage of lexical analysis and the final output stage are relative uninteresting and are not considered here. The top level of the code handles syntax analysis, code generation, and assembly. [The Art of Prolog, p. 460, emphasis mine]

(Note: This is not to say that The Art of Prolog is a bad book; on the contrary, it’s  one of the best treatments of Prolog and logic programming available and should be on any logic programmer’s bookshelf.)

All of these discussions broadcast the same message: writing a lexer is unimportant and even trivial compared to the OOH SHINY DCGs! THEY MAKE PARSING SO EASY! DCGs are presented as the end-all-be-all, the sum of everything you would need for syntactic analysis of a language.

Despite its heavy use, even the sentence grammar itself is actually a poor example of how to write a lexer and parser for a non-trivial application, due to its lack of separation between the lexing and parsing phases. Any compiler book worth its salt (e.g., the Dragon Book) will explain the benefit to having a separate lexer stage, rather than cramming the lexer into the parser. It simplifies the implementation and makes it possible to separate out the different errors that arise during syntactic analysis; the latter is another vital aspect of lexing and parsing that usually gets glossed over during discussions and instruction on DCGs. The bad form of the sentence grammar only serves to further trivialize the lexing stage, essentially presenting it as “so unimportant we actually don’t have to do it properly.”

DCGs Are Not Lexer Generators

The overall lack of of information on writing a lexer using DCGs does not necessarily mean that they cannot be used, or that other techniques are not available to perform lexing in Prolog. For now, I will restrict my attention to DCGs and demonstrate that DCGs are not lexer generators and are, in their current form, inadequate to replace them. Alternatives to DCGs will be discussed in a near-future post.

Let us consider what DCGs truly are — purely syntactic sugar and nothing more. No restriction of computational power is made (the full Turing-complete power of Prolog is available due to the ability to call arbitrary predicates from inside rules and the ability to embed arbitrary code in curly braces inside a rule), and no semantic analysis of the DCG predicate is performed, which rules out automatically executing the bread-and-butter regex -> NFA -> DFA core of a lexer generator. Modifying a DCG implementation so it can serve this role is pretty much out of the question, too. Compare the analysis required for a DCG-based lexer to the analysis for one generated by any lexer generator in existence:

  1. Because there is no clear indication of what DCG rules are part of the given lexer, analysis of the call graph is necessary to determine what the start state of the lexer is and what rules should be considered to be our full lexer. In the worst case, we could have unresolved symbols for calls to predicates in other files, making static analysis nigh-impossible. Lexers for other systems are restricted to be self-contained or restricted to well-known inheritance and inclusion constructs to bring in lexer rules from other files (see, for example, ANTLR).
  2. Once we have obtained our full set of lexer rules, we must ensure that the DCG rules correspond to the regular-language class of computation. Other lexer generator systems do not have this issue, as the lexer files are syntactically and semantically restricted so they can only express regular computations, and any non-regular computations are pushed to actions performed after the token has been recognized.
  3. Only after these two steps are performed can we actually perform the regex -> NFA -> DFA transformations along with any other optimizations deemed appropriate.

So to achieve parity with other lexer generators, a Prolog implementation supporting DCGs would have to perform two extremely non-trivial steps before the actual generation of lexer code begins. To my knowledge, no Prolog implementation performs these steps. Thus, a programmer has three options:

  1. Use DCGs “declaratively,” considering the shortcomings of the implementation as unimportant, then recoil in horror when the performance craters and error handling becomes a nightmare. This is the attitude promoted by the prominent examples of implementing syntactic analysis using DCGs and is in practice the least usable.
  2. Use DCGs to encode a prototype lexer for testing, then throw away that implementation and rewrite it by hand in Prolog or another language. This is pretty pointless when you think about it, especially since Prolog advocates point to DCGs as being such a declarative means of writing parsers and the like.
  3. Add these reasoning steps to a Prolog implementation, a very non-trivial task. In particular, someone merely experimenting with Prolog after hearing about the power of DCGs is going to switch to a different platform or implementation strategy long before attempting this route.
  4. Use DCGs to implement a lexer, but perform the regex -> NFA -> DFA transformation and all other optimizations by hand. I’d expect this is the most common case in practice, but it again flies in the face of the purported declarative nature of the language.

A brief additional discussion on option (2) is in order. There is literature describing methods for writing predicates for syntactic analysis. As previously stated, these methods will be the consideration within the next few posts. However, these methods are no different than describing how to write a procedure for lexical analysis in C or any other language. It is certainly doable, but by restricting the domain of programs that can be written, it becomes much more feasible to analyze the program for errors and optimizations than when the program has the full power of a Turing machine at its disposal. A full Turing-complete programming language is not necessary to solve this problem, so giving the Programmer that level of power means that the Programmer must also take responsibility for performing these analyses themselves. It adds layers of error-prone scaffolding between the problem domain and the solution / implementation that are not necessary and only serve to make things more difficult on the Programmer. As Dr. Cooke said back at Texas Tech: “Everyone should write a lexer by hand once and only once. After that, you should always use a lexer generator!”

If we want Prolog to be taken seriously as a real platform for performing syntactic analysis, the only option to truly address these issues is to add syntax and semantics specifically for writing lexers in Prolog — which is exactly the task of writing a lexer generator!

  1. [1] An almost verbatim echo of this sentence can be found in Logic Programming and Compiler Writing by David H. D. Warren.

Posted in Programming, Prolog.

Tagged with , , , , , .


And we’re back.

Welp, apparently my old blog got hacked and caused Google to flag my site for malware. That’s just fantastic.

Since I hadn’t updated the old blog in about 2 years, and since none of the posts were particularly interesting for others than myself, I decided to just go ahead and remove the old blog and put this one up in its place. I’ll try to start updating it more regularly, especially to serve as a development log for different projects I have going on. Hopefully I’ll be updating later tonight, after I get some more work done on one in particular.

Anyways, all 0 of you that have links to the old blog, update to point to here. (I’m leaving the old URL as dead to help prevent the automated hacker bots from continuing to use my website.)

I chose “chaotic process” after I saw the term used in a model checking paper to indicate a process that can send any possible input to another process. This means that the state space for the entire transition system becomes infinite, as there are an infinite number of possible messages that can be set from the chaotic process. Seemed like a cool name for a blog, so I went with it. And here it is!

As a quick update, I’m still going to grad school at the University of Texas at Dallas, still pursuing my Ph.D. in CS, and still working with Prolog. Work seems to be making progress (finally), though I still will have to finish my master’s work in my dissertation.

We shall see what lies ahead. :)

Posted in General Updates.

Tagged with , , , .