Creating a scripting language: establishing structure

In the previous article, a lexing ruleset was established, but lexing by itself does nothing to describe the actual structure of a language. It’s more like outlining its phonology, closed lexical categories, and maybe its basic morphology, but it does nothing to express how these things relate to each other; that’s the role of the language’s syntax, which is what will be described here. (In most compsci literature, this stuff is all called the grammar, and I will be using the terms interchangeably)

Before getting started with the implementation stuff, here’s the complete grammar specification, in BNF:
[code]
empty :

nodelist : nodelist node
| nodelist function
| empty

node : IDENTIFIER_LOCAL LCURLY expressionlist RCURLY

function : IDENTIFIER_LOCAL LPAREN parameterset RPAREN LCURLY expressionlist RCURLY

parameterset : IDENTIFIER_LOCAL COMMA parameterset
| IDENTIFIER_LOCAL
| empty

argumentset : IDENTIFIER_LOCAL ASSIGN expression COMMA argumentset
| IDENTIFIER_LOCAL ASSIGN expression
| empty

expressionlist : expressionlist conditional
| expressionlist while
| expressionlist assignment SEMICOLON
| expressionlist statement SEMICOLON
| expressionlist expression SEMICOLON
| empty

conditional : IF LPAREN expression RPAREN LCURLY expressionlist RCURLY conditionalalternate conditionalterminal

conditionalalternate : conditionalalternate ELIF LPAREN expression RPAREN LCURLY expressionlist RCURLY
| empty

conditionalterminal : ELSE LCURLY expressionlist RCURLY
| empty

while : WHILE LPAREN expression RPAREN LCURLY expressionlist RCURLY

assignment : IDENTIFIER_LOCAL ASSIGN expression
| IDENTIFIER_LOCAL ASSIGN_ADD expression
| IDENTIFIER_LOCAL ASSIGN_SUBTRACT expression
| IDENTIFIER_LOCAL ASSIGN_MULTIPLY expression
| IDENTIFIER_LOCAL ASSIGN_DIVIDE expression
| IDENTIFIER_LOCAL ASSIGN_DIVIDE_INTEGER expression
| IDENTIFIER_LOCAL ASSIGN_MOD expression
| sequence ASSIGN expression

statement : STMT_GOTO IDENTIFIER_LOCAL
| STMT_RETURN expression
| STMT_RETURN
| STMT_EXIT expression
| STMT_EXIT
| STMT_BREAK
| STMT_CONTINUE

expression : LPAREN expression RPAREN
| sequence
| functioncall
| term
| expression MULTIPLY expression
| expression DIVIDE_INTEGER expression
| expression DIVIDE expression
| expression SUBTRACT expression
| expression ADD expression
| expression MOD expression
| expression AND expression
| expression OR expression
| expression NAND expression
| expression NOR expression
| expression XOR expression
| expression EQUALITY expression
| expression INEQUALITY expression
| expression GREATER_EQUAL expression
| expression GREATER expression
| expression LESSER_EQUAL expression
| expression LESSER expression

sequence : LSQUARE expressionsequence RSQUARE

expressionsequence : expression COMMA expressionsequence
| expression
| empty

functioncall : IDENTIFIER_SCOPED LPAREN argumentset RPAREN
| IDENTIFIER_LOCAL LPAREN argumentset RPAREN
| UNDEFINED LPAREN IDENTIFIER_LOCAL RPAREN

term : STRING
| INTEGER
| FLOAT
| NONE
| TRUE
| FALSE
| IDENTIFIER_SCOPED
| IDENTIFIER_LOCAL
[/code]
As you can see, it’s pretty simple, with only a few top-level elements and rather straight-forward recursive properties. The purpose of the language is highly specialised, after all, and limiting its scope will help to keep it from being abused by any novice developers who might be exposed to the practical implementation without mentorship.

All code will be presented in sections, as was done before, with the full file being available on Launchpad.

I don’t like using magic numbers or hardcoded tokens any more than necessary, so the first thing I did was set up named constants for everything that has even a slight chance of changing or needing to be accessed externally:
[python]
NODE = 0
FUNCTION = 1
STMT_GOTO = 10
STMT_RETURN = 11
STMT_EXIT = 12
STMT_BREAK = 13
STMT_CONTINUE = 14
COND_IF = 20
COND_ELIF = 21
COND_ELSE = 22
COND_WHILE = 23
TERM_IDENTIFIER_LOCAL = 30
TERM_IDENTIFIER_SCOPED = 31
TERM_NONE = 32
TERM_BOOL = 33
TERM_STRING = 34
TERM_INTEGER = 35
TERM_FLOAT = 36
SEQUENCE = 40
TEST_EQUALITY = 50
TEST_INEQUALITY = 51
TEST_GREATER_EQUAL = 52
TEST_GREATER = 53
TEST_LESSER_EQUAL = 54
TEST_LESSER = 55
MATH_MULTIPLY = 70
MATH_DIVIDE = 71
MATH_DIVIDE_INTEGER = 72
MATH_ADD = 73
MATH_SUBTRACT = 74
MATH_AND = 75
MATH_OR = 76
MATH_NAND = 77
MATH_NOR = 78
MATH_XOR = 79
MATH_MOD = 1070
FUNCTIONCALL_LOCAL = 80
FUNCTIONCALL_SCOPED = 81
FUNCTIONCALL_UNDEFINED = 82
ASSIGN = 90
ASSIGN_ADD = 91
ASSIGN_SUBTRACT = 92
ASSIGN_MULTIPLY = 93
ASSIGN_DIVIDE = 94
ASSIGN_DIVIDE_INTEGER = 95
ASSIGN_SEQUENCE = 96
ASSIGN_MOD = 97
[/python]
You’ll notice that MATH_MOD has a value of 1070, since I ran out of space in the math range (and I didn’t want to re-generate my unit-tests by expanding the range into 8x and shifting everything else upwards). Everything else is just an arbitrary token-type-to-integer mapping.

If you’re following along closely, you’ll have noticed that there’s some ambiguity in my grammar. That’s handled with another PLY-specific structure, the precedence tuple:
[python]
precedence = (
(‘right’,
‘EQUALITY’, ‘INEQUALITY’, ‘GREATER_EQUAL’, ‘GREATER’, ‘LESSER_EQUAL’, ‘LESSER’,
‘AND’, ‘OR’, ‘NAND’, ‘NOR’, ‘XOR’,
),
(‘left’, ‘ADD’, ‘SUBTRACT’,),
(‘left’, ‘MULTIPLY’, ‘DIVIDE’, ‘DIVIDE_INTEGER’, ‘MOD’),
)
[/python]
All of the ambiguity has to do with math-related stuff, for the same reason that the term “BEDMAS” probably sounds familiar: brackets > exponents > division > multiplication > addition > subtraction, in terms of resolution order.

The next step is to tell PLY which rule to use as the entry-point into the grammar:
[python]
start = ‘nodelist’
[/python]
This means that the nodelist rule has to be matched for the input to be properly parsed.

From this point, the order in which the rules are defined is arbitrary. I chose to present them in the same form as the grammar itself, but that’s just preference.

The first rule is the empty production:
[python]
def p_empty(p):
r"""
empty :
"""
[/python]
This function implicitly returns None, meaning that PLY does nothing when it's matched. It just means that it's possible to have a void in any other productions by using the rule empty, which is prettier than just having a hole.

Next is the nodeset rule:
[python]
def p_nodeset(p):
r"""
nodelist : nodelist node
| nodelist function
| empty
"""
if p[1] is None:
p[0] = []
else:
p[0] = p[1] + [p[2]]
[/python]
Every script is, by logical definition, a collection of nodes, where a node is either a node-stanza or a function-stanza. There may be any number of either type, in any order. This rule will recursively build a flat list of every node in the input script, by repeatedly invoking itself until it encounters empty, then unwind its path. I believe this is slightly more memory-heavy, but more computationally efficient, than putting the nodelist rule on the right, but my tests were so close that it really just came down to which looked better.

nodes and functions are what make up a nodeset so they logically come next; they're also the first interesting productions:
[python]
def p_node(p):
r"""
node : IDENTIFIER_LOCAL LCURLY expressionlist RCURLY
"""
p[0] = (NODE, p[1], p[3])

def p_function(p):
r"""
function : IDENTIFIER_LOCAL LPAREN parameterset RPAREN LCURLY expressionlist RCURLY
"""
p[0] = (FUNCTION, p[1], p[3], p[6])
[/python]
Each one returns a tuple, identifying its type in the first value, using one of the constants defined before. The syntax is quite C-like, since it's both easier to implement, and it's more likely to be familiar to those who will be working with it. hello{} is a valid node, and goodbye(){} is a valid function. Note that the function takes a parameterset. This is important and it will be described next.

[python]
def p_parameterset(p):
r"""
parameterset : IDENTIFIER_LOCAL COMMA parameterset
| IDENTIFIER_LOCAL
| empty
"""
if p[1] is None:
p[0] = set()
else:
p[0] = set((p[1],))
if len(p) == 4:
p[0].update(p[3])

def p_argumentset(p):
r"""
argumentset : IDENTIFIER_LOCAL ASSIGN expression COMMA argumentset
| IDENTIFIER_LOCAL ASSIGN expression
| empty
"""
if p[1] is None:
p[0] = {}
else:
p[0] = {p[1]: p[3]}
if len(p) == 6:
p[0].update(p[5])
[/python]
What's important here is that these are both sets, not lists. In this language, the order in which parameters are specified is inconsequential, which is good, because it means that anyone developing an API is free to change the order at any time, and it makes it far easier to set/change default values and deprecate/add references, without impacting existing code.

Every parameter is specified as an un-dotted identifier, and every argument takes the form of x=y, borrowing Python's kwargs notation, since it makes it a lot easier for someone to infer meaning from code, without needing to consult a reference manual. A trailing comma may be optionally left at the end of each set (personal preference).

expressionlists are next, since they're the main logic structure in the language; put simply, they're a list of directives that are evaluated, one at a time, in sequential order, just as statements in an imperative language (Java/C#) or expressions in an expressive language (Python) (this language can handle recursion, but its facilities are weak, due to a design decision that minimizes the relevance of stacks, so don't worry about contrasting it with functional languages).
[python]
def p_expressionlist(p):
r"""
expressionlist : expressionlist conditional
| expressionlist while
| expressionlist assignment SEMICOLON
| expressionlist statement SEMICOLON
| expressionlist expression SEMICOLON
| empty
"""
if p[1] is None:
p[0] = []
else:
p[0] = p[1] + [p[2]]
[/python]
An expressionlist may contain any number of conditionals, whiles, assignments, statements, or expressions, each of which, with the exception of conditionals, must be delimited with a semicolon. The decision to require a semicolon was arbitrary (it could just have easily used linebreaks, like Python, or end-of-declarative inference, like T-SQL; requiring a semicolon just seemed like the right thing to do, considering how C-like the rest of its syntax is).

Conditionals are the language's only predictable branching control structure; as with most other languages, it uses an if/elif/else system (note: elif, nothing else; there's a great dichotomy between which is most standard, so elif was chosen because it's prettier and easier to type).
[python]
def p_conditional(p):
r"""
conditional : IF LPAREN expression RPAREN LCURLY expressionlist RCURLY conditionalalternate conditionalterminal
"""
p[0] = [COND_IF, (p[3], p[6])] + p[8] + p[9]

def p_conditionalalternate(p):
r"""
conditionalalternate : conditionalalternate ELIF LPAREN expression RPAREN LCURLY expressionlist RCURLY
| empty
"""
if p[1] is None:
p[0] = []
else:
p[0] = p[1] + [(COND_ELIF, p[4], p[7])]

def p_conditionalterminal(p):
r"""
conditionalterminal : ELSE LCURLY expressionlist RCURLY
| empty
"""
if p[1] is None:
p[0] = []
else:
p[0] = [(COND_ELSE, p[3])]
[/python]
Looking at the rules you can see that things like the following are valid, as you would expect from a C-like language:

if(x == 5){
}elif(a == b){
}
elif ( 1 == 2 )
{
}else{
}

though my preference is for 1TBS, even though Allman and other variants are supported. Something like

if(x == x)
    something();

is expressly disallowed, however, since it just leads to accidents.

Whiles are the language's looping structure; nothing's wildly different from any other language:
[python]
def p_while(p):
r"""
while : WHILE LPAREN expression RPAREN LCURLY expressionlist RCURLY
"""
p[0] = (COND_WHILE, p[3], p[6])
[/python]
It's important to note that this language will emphasise Java's archaic .next()/.prev() pattern as a means of stepping through a collection of values as part of an object, rather than using something like the more modern iterator pattern embraced by Python (and, quite recently, by Java and C#). This may seem like a backwards decision, but the rationale provides a pretty good argument: it will discourage writing systems that rely heavily on looping over data at the script level (which should be executing, not processing) and the fetch-on-demand approach, versus having all needed data readily available; it will do this by being intentionally cumbersome. (It'll still be possible to do these things, however, but it'll require the implementor to actually stop to think whether there's an easier (better) way)

At this point, there's no logical order for introductions, so assignments will just come next:
[python]
def p_assignment(p):
r"""
assignment : IDENTIFIER_LOCAL ASSIGN expression
"""
p[0] = (ASSIGN, p[1], p[3])
def p_assignment_augmented(p):
r"""
assignment : IDENTIFIER_LOCAL ASSIGN_ADD expression
| IDENTIFIER_LOCAL ASSIGN_SUBTRACT expression
| IDENTIFIER_LOCAL ASSIGN_MULTIPLY expression
| IDENTIFIER_LOCAL ASSIGN_DIVIDE expression
| IDENTIFIER_LOCAL ASSIGN_DIVIDE_INTEGER expression
| IDENTIFIER_LOCAL ASSIGN_MOD expression
"""
if p[2] == '+=':
p[0] = (ASSIGN_ADD, p[1], p[3])
elif p[2] == '-=':
p[0] = (ASSIGN_SUBTRACT, p[1], p[3])
elif p[2] == '*=':
p[0] = (ASSIGN_MULTIPLY, p[1], p[3])
elif p[2] == '/=':
p[0] = (ASSIGN_DIVIDE, p[1], p[3])
elif p[2] == '\=':
p[0] = (ASSIGN_DIVIDE_INTEGER, p[1], p[3])
elif p[2] == '%=':
p[0] = (ASSIGN_MOD, p[1], p[3])
def p_assignment_sequence(p):
r"""
assignment : sequence ASSIGN expression
"""
p[0] = (ASSIGN_SEQUENCE, p[1], p[3])
[/python]
This language supports classic assignments, of the form x = 5, but it also supports augmented assignments, like x += 5 (the addition syntax also works for strings, as a simple form of concatenation, but no others are supported, though that can only be enforced at the semantic level), and something derived from Python: sequence assignments. This allows you to do things like [a, b] = [b, a] to swap the values assigned to the variables, and [a, b, c] = data.something_that_returns_a_triple() to store three values from a single function-call (the return); the semantics surrounding this behaviour derive from PHP's list-packing functions: if the receiving sequence is too small for all the data, the extra values are just discarded, and if the receiving sequence is too large, the extra slots are assigned None.

All assignments and variable-access will occur within a flat global namespace, with the exception of variables declared as parameters to a function; those will always refer to the given values, which are passed by value, and which are discarded when the function ends, allowing for local assignment for recursion purposes, in the rare cases where it might be needed. The variable namespace is separate from the node and function namespaces, so it's legal to assign to a variable named "hello" if there is a node named "hello", but this should probably be avoided, to minimize the possibility of confusion.

It's also worth noting that the name of the function, so long as it's prefixed with p_, doesn't matter. The order of the docstrings is significant, however, since PLY will concatenate all rules together in the order they're encountered.

Statements are next:
[python]
def p_statement_goto(p):
r"""
statement : STMT_GOTO IDENTIFIER_LOCAL
"""
p[0] = (STMT_GOTO, p[2])
def p_statement_return(p):
r"""
statement : STMT_RETURN expression
| STMT_RETURN
"""
if p[1] is None:
p[0] = (STMT_RETURN, (TERM_NONE, None))
else:
p[0] = (STMT_RETURN, p[2])
def p_statement_exit(p):
r"""
statement : STMT_EXIT expression
| STMT_EXIT
"""
if len(p) == 2:
p[0] = (STMT_EXIT, (TERM_STRING, ''))
elif len(p) == 3:
p[0] = (STMT_EXIT, p[2])
def p_statement_break(p):
r"""
statement : STMT_BREAK
"""
p[0] = (STMT_BREAK,)
def p_statement_continue(p):
r"""
statement : STMT_CONTINUE
"""
p[0] = (STMT_CONTINUE,)
[/python]

  • goto requires a local identifier, which must be the name of a node. It immediately terminates execution in the current node and begins execution in the target.
  • return may take an expression, returning it to the caller; if no expression is given, None is inserted by default. Used in the context of a node, rather than a function, return is functionally equivalent to exit, but it's bad practice.
  • exit may take an expression, which will be converted into a string representation (None maps to the empty string, which is inserted if no expression is provided) and returned to the interpreter's caller, providing a means of allowing an external observer to acquire a record of the events processed. Sequences are converted recursively, with \0 inserted between each element as a delimiter; \1 is used to mark the start and end of each sequence nested inside the top level, making both of these values reserved as return code members.
  • break exits a loop immediately; used outside of a loop, it has no effect.
  • continue ignores all expressions remaining in the current loop iteration; used outside of a loop, it has no effect.

And now expressions, which are both the most numerous, most significant, and most obvious part of the language:
[python]
def p_expression(p):
r"""
expression : LPAREN expression RPAREN
| sequence
| functioncall
| term
"""
if len(p) == 4:
p[0] = p[2]
else:
p[0] = p[1]
[/python]
An expression may be wrapped in an arbitrarily deep number of parentheses, so long as they're balanced, both to make the code more readable and to enforce ordering when dealing with arithmetic.

Once unwrapped, each expression may be an instance of a more complicated expression-type, described below, or a sequence (which is itself an ordered list of expressions, described below), a functioncall, as found in most other languages (no disrespect to the purity of LISP and its descendants intended), or a term, which is pretty much any token that hasn't yet been addressed (described below).

More complex expressions are themselves defined in terms of relationships between expressions:
[python]
def p_expression_math(p):
r"""
expression : expression MULTIPLY expression
| expression DIVIDE_INTEGER expression
| expression DIVIDE expression
| expression SUBTRACT expression
| expression ADD expression
| expression MOD expression
| expression AND expression
| expression OR expression
| expression NAND expression
| expression NOR expression
| expression XOR expression
"""
if p[2] == '+':
p[0] = (MATH_ADD, p[1], p[3])
elif p[2] == '-':
p[0] = (MATH_SUBTRACT, p[1], p[3])
elif p[2] == '*':
p[0] = (MATH_MULTIPLY, p[1], p[3])
elif p[2] == '/':
p[0] = (MATH_DIVIDE, p[1], p[3])
elif p[2] == '\\':
p[0] = (MATH_DIVIDE_INTEGER, p[1], p[3])
elif p[2] == '%':
p[0] = (MATH_MOD, p[1], p[3])
elif p[2] == 'and':
p[0] = (MATH_AND, p[1], p[3])
elif p[2] == 'or':
p[0] = (MATH_OR, p[1], p[3])
elif p[2] == 'nand':
p[0] = (MATH_NAND, p[1], p[3])
elif p[2] == 'nor':
p[0] = (MATH_NOR, p[1], p[3])
elif p[2] == 'xor':
p[0] = (MATH_XOR, p[1], p[3])
def p_expression_test(p):
r"""
expression : expression EQUALITY expression
| expression INEQUALITY expression
| expression GREATER_EQUAL expression
| expression GREATER expression
| expression LESSER_EQUAL expression
| expression LESSER expression
"""
if p[2] == '==':
p[0] = (TEST_EQUALITY, p[1], p[3])
elif p[2] == '!=':
p[0] = (TEST_INEQUALITY, p[1], p[3])
elif p[2] == '>=':
p[0] = (TEST_GREATER_EQUAL, p[1], p[3])
elif p[2] == '>':
p[0] = (TEST_GREATER, p[1], p[3])
elif p[2] == '<=':
p[0] = (TEST_LESSER_EQUAL, p[1], p[3])
elif p[2] == '<':
p[0] = (TEST_LESSER, p[1], p[3])
[/python]
expressions cover the full gamut of math and equivalence tests, which shouldn't require any further explanation.

A sequence, as mentioned, is just an ordered list of expressions:
[python]
def p_sequence(p):
r"""
sequence : LSQUARE expressionsequence RSQUARE
"""
p[0] = (SEQUENCE, p[2])

def p_expressionsequence(p):
r"""
expressionsequence : expression COMMA expressionsequence
| expression
| empty
"""
if p[1] is None:
p[0] = []
else:
p[0] = [p[1]]
if len(p) == 4:
p[0] += p[3]
[/python]
They start with a square bracket, end with a square bracket, and may have an optional trailing comma; they may also be empty, if the situation warrants it (like a scoped functioncall that takes a list of values to process). There's a key semantic imperative to their use, which can't be adequately captured by the grammar (because there's a term-reduction ambiguity associated with it that would take far too many rules to resolve), but it's of significant, if wholly intuitive, importance: when assigning to a sequence, all of the expressions must be local identifiers. I can't imagine anyone trying to do otherwise, but it can only be enforced at the semantic level, so it needs to be stated.

Next are function-calls, which are pretty straight-forward, as long as argumentsets made sense:
[python]
def p_functioncall_scoped(p):
r"""
functioncall : IDENTIFIER_SCOPED LPAREN argumentset RPAREN
"""
p[0] = (FUNCTIONCALL_SCOPED, p[1], p[3])
def p_functioncall_local(p):
r"""
functioncall : IDENTIFIER_LOCAL LPAREN argumentset RPAREN
"""
p[0] = (FUNCTIONCALL_LOCAL, p[1], p[3])
def p_functioncall_test_undefined(p):
r"""
functioncall : UNDEFINED LPAREN IDENTIFIER_LOCAL RPAREN
"""
p[0] = (FUNCTIONCALL_UNDEFINED, p[3])
[/python]
Local and scoped functioncalls have the same syntactic structure; the difference exists simply for optimization reasons at the interpreter level (though scoped calls will still need to check local variables, in case the function being called exists on an object referenced locally). An example of a function-call follows:

interface.get_greeting(
 time_of_day=times.EVENING,
 mood=emotions.CHEERFUL,
)

would probably return the string "Good evening!". Some functioncalls, however, are exempt from the argumentset semantics; these are limited to protected functions identified by two leading underscores, like __undefined(x), as described in the previous article.

Last are the terms, which are the parts of the language that actually have concrete meaning:
[python]
def p_term(p):
r"""
term : STRING
| INTEGER
| FLOAT
"""
if isinstance(p[1], basestring):
p[0] = (TERM_STRING, p[1])
elif type(p[1]) in (int, long):
p[0] = (TERM_INTEGER, p[1])
elif type(p[1]) == float:
p[0] = (TERM_FLOAT, p[1])
def p_term_special(p):
r"""
term : NONE
| TRUE
| FALSE
"""
if p[1] == 'None':
p[0] = (TERM_NONE, None)
elif p[1] == 'True':
p[0] = (TERM_BOOL, True)
elif p[1] == 'False':
p[0] = (TERM_BOOL, False)
def p_term_identifier_scoped(p):
r"""
term : IDENTIFIER_SCOPED
"""
p[0] = (TERM_IDENTIFIER_SCOPED, p[1])
def p_term_identifier_local(p):
r"""
term : IDENTIFIER_LOCAL
"""
p[0] = (TERM_IDENTIFIER_LOCAL, p[1])
[/python]
Nothing should need to be explained here; these values are largely just passed through from the lexer, with processing applied only to the reserved values, as they're converted into Python values on discovery.

The only thing left is PLY's error-handling function. As decided for the lexer, the nature of this language precludes the availability of a semantic validation step (except when the interpreter is run in test-mode, for people to assert the structural and logical correctness of their code). For that reason, it will raise ValueError at the first sign of trouble:
[python]
def p_error(t):
if t:
raise ValueError("Syntax error; offending token on line %(line)i: %(token)s" % {
'line': t.lineno,
'token': repr(t.value),
})
else:
raise ValueError("Unexpectedly reached end of file")
[/python]

At this point, the language’s structure has been established, so we can digest any valid script and identify problems with invalid ones. However, nothing can yet be processed; that's the role of the interpreter. I'll be describing the parser next, which serves as the gateway between the language's grammar as the interpreting infrastructure, and it represents the point where many of you will probably diverge from this series, since the language will start to lose its flexibility as it's cast into an engine that will define its usage for my specific needs.

Leave a Reply

Your email address will not be published. Required fields are marked *