Modern programming languages increasingly contain features that allow the host language system to be extended by the programmer. The requirements for extensibility stem from a desire to allow problem and solution domain features to be represented directly or declaratively in the program and for the execution engine of the host language to directly execute new features, thereby having knowledge of the language extensions. Features such as meta-programming in Smalltalk, the CLOS MOP, reflection in LISP and Java, Java Annotations, C++ Templates etc., in addition to features of more specialized languages, all offer various ways of extensibility. However these languages are complex, and extensibility features are often incomplete. We will use the term superlanguage to describe a programming system that can be extended. This article is the first of a series that describes a simple core language that captures the essential features of being a superlanguage and shows examples of how it can be used to build languages from modular units.
A superlanguage offers the following features:
o Concrete syntax processing. Such a feature must allow new language constructs to be defined.
o First-class language modules. Language definitions should be data values, subject to normal scoping rules, parameteric etc.
o Meta-Programming. Arbitrary computations can be performed when processing syntax in order to process declarative language constructs.
o Access to the abstract syntax ADT of host programming language that can be used to construct and transform arbitrary programs.
o Control over the scope of languages so that, once defined, a new language feature can be placed in scope for a given extent of program text.
o The ability to construct recursive definitions including those that involve grammars and language constructs.
The language being introduced in this article is called XPL. It is a small, simple and universal superlanguage. In XPL, grammars are first class data values. An XPL program consists of top-level definitions supplied as name-value pairs. The following shows a top-level definition of a grammar names arith:
arith = {
start
-> a=atom
tail^(a);
atom
-> int
| '(' a=arith ')' { a };
int
-> n=(['0','9'])+ { asInt(n) };
tail(left)
-> o=op right=start {
case
o {
'*' -> left * right;
'/' -> left / right
};
tail(left)
-> { left };
op
-> ('/' | '*')
}
The grammar consists of a sequence of rules. Each rule has a name, an arrow (->), and a body. The body defines how the concrete syntax is to be processed and transformed into a value. The definition is repeated below with comments:
arith = {
// The start rule calls the rule named
ÔatomÕ
// calls the
resulting value ÔaÕ and then calls
// the rule
named ÔtailÕ passing the value of aÉ
start
-> a=atom
tail^(a);
// An atom is either an int or an arith
in parentheses.
// Notice that ÔarithÕ is a recursive
reference to
// the
grammar. When a grammar is used, the starting
// rule is
called (the first rule defined in the grammar).
// Rule actions in Ô{Ô and Ô}Õ
synthesize values and
// can
reference any of the vriables bound in the ruleÉ
atom
-> int
| '(' a=arith ')' { a };
// An integer is a non-empty sequence
of integer chars.
// The use of Ô+Õ produces a sequence
of values named
// ÔnÕ that is
passed to the function ÔasIntÕ that
// transforms
a sequence of integers in the range Ô0Õ Ð Ô9Õ
// into the
corresponding integerÉ
int
->
n=(['0','9'])+ { asInt(n)
};
// The rule named ÔtailÕ takes an
argument named ÔleftÕ
// it
processes an operator and a right hand expression
// The action performs case analysis on
the operator
// in order to
perform the correct arithmetic expressionÉ
tail(left)
-> o=op right=arith {
case
o {
'*' -> left * right;
'/' -> left / right
};
// The tail rule has two alternatives,
here is the second
// that is
used if no operator is foundÉ
tail(left)
-> { left };
// An operator can be a divide or
multiply symbolÉ
op
-> ('/' | '*')
}
The
grammar can then be used to process a string written in the
language:
arith.parse(Õ10
* 20Õ)
=> 200
This type of language embedding is supported by a few language systems (for example SQL in Java). However, the host and the embedded languages are not really integrated properly since there is a complete phase shift when the string is interpreted by the grammar. Furthermore, the syntax and semantic processing of the language is merged together into the grammar which makes the language definition difficult to reuse and extend. The next post will look at separating out the syntax and semantics of the language into a language module.