Introducing Features for Superlanguages

Modern programming languages increasingly contain features that allow the host language system to be extended by the programmer. The requirements for extensibility stem from a desire to allow problem and solution domain features to be represented directly or declaratively in the program and for the execution engine of the host language to directly execute new features, thereby having knowledge of the language extensions. Features such as meta-programming in Smalltalk, the CLOS MOP, reflection in LISP and Java, Java Annotations, C++ Templates etc., in addition to features of more specialized languages, all offer various ways of extensibility. However these languages are complex, and extensibility features are often incomplete. We will use the term superlanguage to describe a programming system that can be extended. This article is the first of a series that describes a simple core language that captures the essential features of being a superlanguage and shows examples of how it can be used to build languages from modular units.

A superlanguage offers the following features:

o Concrete syntax processing. Such a feature must allow new language constructs to be defined.

o First-class language modules. Language definitions should be data values, subject to normal scoping rules, parameteric etc.

o Meta-Programming. Arbitrary computations can be performed when processing syntax in order to process declarative language constructs.

o Access to the abstract syntax ADT of host programming language that can be used to construct and transform arbitrary programs.

o Control over the scope of languages so that, once defined, a new language feature can be placed in scope for a given extent of program text.

o The ability to construct recursive definitions including those that involve grammars and language constructs.

The language being introduced in this article is called XPL. It is a small, simple and universal superlanguage. In XPL, grammars are first class data values. An XPL program consists of top-level definitions supplied as name-value pairs. The following shows a top-level definition of a grammar names arith:

arith = {

start -> a=atom tail^(a);

atom -> int | '(' a=arith ')' { a };

int -> n=(['0','9'])+ { asInt(n) };

tail(left) -> o=op right=start {

case o {

'*' -> left * right;

'/' -> left / right

};

tail(left) -> { left };

op -> ('/' | '*')

}

The grammar consists of a sequence of rules. Each rule has a name, an arrow (->), and a body. The body defines how the concrete syntax is to be processed and transformed into a value. The definition is repeated below with comments:

arith = {

// The start rule calls the rule named ‘atom’

// calls the resulting value ‘a’ and then calls

// the rule named ‘tail’ passing the value of a…

start -> a=atom tail^(a);

// An atom is either an int or an arith in parentheses.

// Notice that ‘arith’ is a recursive reference to

// the grammar. When a grammar is used, the starting

// rule is called (the first rule defined in the grammar).

// Rule actions in ‘{‘ and ‘}’ synthesize values and

// can reference any of the vriables bound in the rule…

atom -> int | '(' a=arith ')' { a };

// An integer is a non-empty sequence of integer chars.

// The use of ‘+’ produces a sequence of values named

// ‘n’ that is passed to the function ‘asInt’ that

// transforms a sequence of integers in the range ‘0’ – ‘9’

// into the corresponding integer…

int -> n=(['0','9'])+ { asInt(n) };

// The rule named ‘tail’ takes an argument named ‘left’

// it processes an operator and a right hand expression

// The action performs case analysis on the operator

// in order to perform the correct arithmetic expression…

tail(left) -> o=op right=arith {

case o {

'*' -> left * right;

'/' -> left / right

};

// The tail rule has two alternatives, here is the second

// that is used if no operator is found…

tail(left) -> { left };

// An operator can be a divide or multiply symbol…

op -> ('/' | '*')

}

The grammar can then be used to process a string written in the language:

arith.parse(’10 * 20’)

=> 200

This type of language embedding is supported by a few language systems (for example SQL in Java). However, the host and the embedded languages are not really integrated properly since there is a complete phase shift when the string is interpreted by the grammar. Furthermore, the syntax and semantic processing of the language is merged together into the grammar which makes the language definition difficult to reuse and extend. The next post will look at separating out the syntax and semantics of the language into a language module.