Domain-specific languages – Initial impressions of ANTLR4.

I’ve always been into writing my own domain-specific languages — I firmly believe that used right, they are a massive win for developers. Your system is different from everyone else’s, so the only person able to write the perfect language for your problem is you. General-purpose programming languages look awfully crufty compared to an elegant, dedicated language.

The problem, of course, is that a dedicated language is something of a beast to write, at least if you start from scratch. I’ve been programming DSLs for a few years, now, and I’ve even written the Canto toolkits, some open-source parsing toolkits for JavaScript and C#. But this kind of small-scale craftsman approach is to be compared with the industrial-strength behemoth that is Antlr. Antlr is a toolkit some 25 years in the making, and I’ve been tracking it’s progress, on and off, since about version 2. It recently moved to version 4. In the past, I’ve had trouble with it; it imposed such a strong idiom to your way of writing and it wasn’t one that gelled very well with my own way of approaching the problem. However, it’s just moved to v4, so I thought I’d give it another look.

The truth is, Antlr4 is good. Really good. Depressingly so, for someone who’s writing parsing toolkits. In fact, I love it. I love it more than my own projects.

The idea with Antlr is that you write a `grammar` for your language — describing the syntax that your language has. Here’s a snippet of the grammar to recognise JSON, from https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4

grammar JSON;

json: object | array ;

object : '{' pair (',' pair)* '}'
       | '{' '}' ;

pair: STRING ':' value ;

So if you know JSON, you should be able to read this without difficulty. The first rules says “a JSON document is either an object or an array”. The second implements this line from JSON.org; “An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).”

You write the grammar, and Antlr builds you code to recognise that language. It handles two main jobs for you. The first is lexing — splitting a file into discrete words, or tokens — and the second is parsing — recognising syntactic structures like statements, lists, loop declarations, etc. Previous versions did that, but not in a way I ever really grokked. Until Antlr4…

So what makes Antlr4 so good?

First, the integration for Visual Studio is superb. There’s a VS extension which basically makes ANTLR part of C#; just do “Add | New Item… | ANTLR Combined Grammar” and you’ve started developing your language. Every time you build, it builds you a parser from the grammar. This is a big deal; no messing with the guts of .csproj files, or running hokey batch scripts. It’s better integrated than tools like T4, for example, which are actually part of visual studio.

Second, the structure of grammars has changed to make describing your language a distinct act from writing an application like an interpreter. Previously, you had to embed scraps of C# code throughout the grammar to make it do anything. That left something of a bad taste in my mouth in previous versions. Code was powerful, but got really ugly. By that I mean, you couldn’t look at an ANTLR grammar and say ‘Ah, I see clearly what the language looks like!’ There were bits and bats all over the grammar itself, obscuring the definition of the language. Now, Antlr takes the approach of providing post-parsing tools like listeners and visitors, which help you separate the act of parsing from the act of interpreting. This is the way a good parser should be structured, so I’m really glad to see that Antlr allows that kind of structure.

Just those two things meant that I felt confident letting other devs in on the secret. It doesn’t take a language wonk to be able to extend a language or interpreter. I know this because my pair-programming partner at work, when I was called away for a meeting, extended the language and the interpreter I’d been working on with literally no reading on the subject; just looked at the code, saw what it was doing, and extended it, in the time it took me to attend a meeting. Part of that is that she’s just a very smart dev; the other part is that it’s an intuitive language to play with, and if you understand the Visitor pattern, and if regular expressions don’t make you cry, you’ll pick it up.

Anyway, I can’t recommend it enough. Go try it for yourself.

2 thoughts on “Domain-specific languages – Initial impressions of ANTLR4.”

Emilio Santos says:

February 16, 2015 at 8:59 pm

Hey, thanks for all the posts! I was hoping you could help me out, I’m trying to use ANTLR dynamically (i.e without the vs extension) and from a website written in C#. My problem is with the java dependencies, do you know of an easy way to use ANTLR this way?

I develop Excess (http://xslang.azurewebsites.net/), where users can write extensions to c# from their browsers. It would be an awesome feature if that extensions could carry its own grammar. I have very little trouble making all that happen once I get the parser files, but not being a web dev by trade I’m a little wary of getting java to work on azure, etc…

Could you help me out with that?

- Steve Cooper says:
  
  February 17, 2015 at 12:18 am
  
  Excess looks really interesting! Wish I had more time to look at Roslyn…
  
  As to your question, the way I understand it is that the Anltr tool itself only has a Java implementation. So to translate G4 grammar files to .cs, you need Java.
  
  However, it can output a lexer and parser in C#, and those have no dependency on Java.
  
  So if you want the Excess tool itself to include a custom language (a DSL for defining DSLs), you can develop that – your dev machine would have Java and you would ship a pure CS solution to azure. You just need the Antlr4 and antlr4.runtime nuget packages, if memory serves; those can be happily shipped to an azure website or worker role or whatever. Won’t work if you don’t control the build machine – say, if you have visual studio online doing your builds.
  
  If you want users to post Antlr grammars to your site, and you process each grammar as it hits the server, then you may be out of luck. At least, you’ll need to probably complicate your azure deployment to include a full-blown cloud service with at least one VM which can have Java installed.
  
  There is every change I have misunderstood the problem, though; it seems you want to introduce a compiler (Antlr) so that your compiler (Excess) can compile other people’s compilers (DSLs), as a way to modify the C# compiler (Roslyn) and thereby make a compiler for a new dialect of C#.

Steve-Driven Development

Modern Web Development Bits and Pieces

Domain-specific languages – Initial impressions of ANTLR4.

2 thoughts on “Domain-specific languages – Initial impressions of ANTLR4.”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Domain-specific languages – Initial impressions of ANTLR4.”

Leave a comment Cancel reply