Creating a scripting language: establishing a foundation

Anyone who’s trawled through my projects knows that I love Python: it’s simple, it’s flexible, it does almost everything well, and it just works. It’s also a very good fit for this project, since part of the system driving it is being implemented using Boost, which means that Boost.Python is just a small step away for setting up a cross-language control structure. (Before anyone complains, I’ll freely concede that something like Haskell is an ideal candidate for building a new language from scratch, but I’m cheating and using an existing compiler-compiler to skip the parts functional languages are best at)

Within the realm of Python, a number of good compiler-compilers exist, with each having its own pros and generally very few cons (some are implemented in C, some offer great insight into the compilation process). Of what’s available, I decided to use David Beazley’s excellent PLY, for its extensive documentation, maturity, clear expression methodology, and portability (I need to be able to support Windows, too, unfortunately).

Getting started was easy enough: I cobbled together knowledge from the PLY documentation and a practical example found at http://blog.deliciousrobots.com/2010/3/24/a-simple-compiler-with-python-and-ply-step-1-parsing/, making sure I could do some really basic tasks, to prove I wasn’t wasting my time. Then I spent a week drawing automata and reducing grammar rules on paper to make sure I could both support my conceptual model of a complex script to be handled by the language and keep things simple enough for others to maintain. (This happened concurrently with Rorona-related gaming, so it may have taken less time, had I been entirely focused)

Once I knew what I wanted, I set about writing the lexing rules, which are what PLY uses to figure out what characters are valid in a script and where they may appear, breaking everything into easy-to-digest tokens, like integers, strings, identifiers, and syntax elements like braces. What follows is its current form, since I added some more features as I was implementing the grammar (like user-definable functions, sequences, and some control expressions). I’m breaking the code up into pieces here to make it easier to explain things; the full file is available on Launchpad.

To start with, I needed to identify tokens that could conflict with user-defined identifiers (variable names and the like) and give them a PLY-known identifier value. This is done with the reserved dictionary, which maps a literal token to a special identifying tag, itself also a string:
[python]
reserved = {
#conditionals
‘if’: ‘IF’,
‘elif’: ‘ELIF’,
‘else’: ‘ELSE’,
‘while’: ‘WHILE’,
#states
‘None’: ‘NONE’,
#literals
‘True’: ‘TRUE’,
‘False’: ‘FALSE’,
#operands
‘and’: ‘AND’,
‘or’: ‘OR’,
‘nand’: ‘NAND’,
‘nor’: ‘NOR’,
‘xor’: ‘XOR’,
#functions
‘__undefined’: ‘UNDEFINED’,
#statements
‘goto’: ‘STMT_GOTO’,
‘return’: ‘STMT_RETURN’,
‘exit’: ‘STMT_EXIT’,
‘break’: ‘STMT_BREAK’,
‘continue’: ‘STMT_CONTINUE’,
}
[/python]
As you can see, I’m drawing a fair bit of inspiration from Python for language syntax elements, withelif instead of C’s more wordy else if or the ugly elseif, capitalised singleton-identifiers (and None instead of NULL), and boolean operands written as words, rather than ugly ||-like things. There’s also allowance for a protected function namespace (the two leading underscores) on __undefined, which is a special handler that accepts a local identifier, like __undefined(x) and returns a boolean stating whether x is a known variable; this is because the working idea is for an undefined variable to pretend to be None, rather than raising an error, the same way Javascript works — this will be controllable with a script-settable “strict” flag, of course.

You’ll also notice that there are two unusual statements: goto and exit; the latter of them should make immediate sense, as a means of exiting a script immediately, with an optional value, but the former’s often associated with bad practise in most languages, so why is it here? The answer is that I want to limit how big this language can get, to avoid recreating a scenario similar to what’s driving its development now: too much logic being shoved into a place where it doesn’t belong. The language will use a simple, flat local namespace, executing instructions grouped together in nodes (small stanzas), with control being passed from node to node as the flow progresses, to allow construction of a graph; goto is the statement used to initiate a jump.

Conversion functions, for the curious, will be part of a lang namespace, like lang.round(x, 2), shipped as an internal interpreter module. This is why there’s no reservation for them.

The next step was to tell PLY what other sorts of tokens exist in this language. This is done with the tokens list:
[python]
tokens = [
‘IDENTIFIER_SCOPED’,
‘IDENTIFIER_LOCAL’,
#literals
‘STRING’,
‘FLOAT’,
‘INTEGER’,
#tests
‘EQUALITY’,
‘INEQUALITY’,
‘GREATER_EQUAL’,
‘GREATER’,
‘LESSER_EQUAL’,
‘LESSER’,
#assignments
‘ASSIGN’,
‘ASSIGN_ADD’,
‘ASSIGN_SUBTRACT’,
‘ASSIGN_MULTIPLY’,
‘ASSIGN_DIVIDE’,
‘ASSIGN_DIVIDE_INTEGER’,
‘ASSIGN_MOD’,
#operands
‘MULTIPLY’,
‘DIVIDE’,
‘DIVIDE_INTEGER’,
‘ADD’,
‘SUBTRACT’,
‘MOD’,
#delimiters
‘LPAREN’,
‘RPAREN’,
‘LCURLY’,
‘RCURLY’,
‘LSQUARE’,
‘RSQUARE’,
‘COMMA’,
‘SEMICOLON’,
] + list(reserved.values())
[/python]
The order of appearance is important, since that’s how PLY resolves ambiguities between conflicts, like == versus =: if == follows = in the spec, two = tokens will be output, instead of one ==. It’s also important to note that the reserved values are added to this list at the end: the author is responsible for mapping them when they’re found as part of another token’s spec, but PLY uses the tokens list to perform all of its lookup operations; this will make more sense after the next section.

The first thing that needs to be clarified is what the difference between IDENTIFIER_SCOPED and IDENTIFIER_LOCAL happens to be: a scoped identifier is anything that has a dot in it, meaning that it needs to be looked up outside of the currently executing flow’s namespace; a local identifier, on the other hand, is anything without a dot, like a conventional variable name.

The other tokens should be pretty self-evident (though they’re expanded below), with the exception of DIVIDE_INTEGER. This is just floor-division, made immediately accessible using the popularised-by-Visual-Basic \ notation, rather than Python’s more obscure //.

The final important step is providing definitions for all of these toke-types, so that PLY knows what it’s actually looking for. This is done through automatic reflection of the module’s namespace, looking for variables and functions prefixed with t_, and extracting their value or docstring, to build a matching regular expression. If a function is used, it’s also called to transform the value, to make further processing easier.

Here’s what I have, in the same order as the token-types (though this is not a requirement):
[python]
def t_IDENTIFIER_SCOPED(t):
r'[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)+’
t.type = reserved.get(t.value, ‘IDENTIFIER_SCOPED’)
return t
def t_IDENTIFIER_LOCAL(t):
r'[a-zA-Z_][a-zA-Z0-9_]*’
t.type = reserved.get(t.value, ‘IDENTIFIER_LOCAL’)
return t

def t_STRING(t):
r'(?:"(?:[^"]|(?<=\\)")*"|\'(?:[^\’]|(?<=\\)\’)*\’)’
t.value = t.value[1:-1]
for (escape, replacement) in (
(‘\\\\’, ‘\\’),
(‘\\b’, ‘\b’),
(‘\\t’, ‘\t’),
(‘\\n’, ‘\n’),
(‘\\a’, ‘\a’),
(‘\\r’, ‘\r’),
(‘\\"’, ‘"’),
("\\’", "’"),
):
t.value = t.value.replace(escape, replacement)
return t
def t_FLOAT(t):
r’-?\d+\.\d+’
t.value = float(t.value)
return t
def t_INTEGER(t):
r’-?\d+’
t.value = int(t.value)
return t

t_EQUALITY = r’==’
t_INEQUALITY = r’!=’
t_GREATER_EQUAL = r’>=’
t_GREATER = r’>’
t_LESSER_EQUAL = r'<=’
t_LESSER = r'<‘

t_ASSIGN = r’=’
t_ASSIGN_ADD = r’\+=’
t_ASSIGN_SUBTRACT = r’\-=’
t_ASSIGN_MULTIPLY = r’\*=’
t_ASSIGN_DIVIDE = r’/=’
t_ASSIGN_DIVIDE_INTEGER = r’\\=’
t_ASSIGN_MOD = r’%=’

t_MULTIPLY = r’\*’
t_DIVIDE = r’/’
t_DIVIDE_INTEGER = r’\\’
t_ADD = r’\+’
t_SUBTRACT = r’\-‘
t_MOD = r’%’

t_LPAREN = r’\(‘
t_RPAREN = r’\)’
t_LCURLY = r'{‘
t_RCURLY = r’}’
t_LSQUARE = r’\[‘
t_RSQUARE = r’\]’
t_COMMA = r’,’
t_SEMICOLON = r’\;’

def t_newline(t):
r'(?:\r\n|\n|\r)+’
t.lexer.lineno += len(t.value) – t.value.count(‘\r\n’)

t_ignore_whitespace = r'[ \t]’
t_ignore_comment = r'(?:\#|//).*’
[/python]
You’ll notice that both of the identifier-types defer to the reserved keyword set to avoid conflicts. A bit less obvious is that the string type allows for escaped quotations and expression using both double- and single-quotes (personal preference), with the actual encapsulating quotation marks stripped off ath this stage. Integers and floats are converted into native Python data-types at this stage, too.

Everything else is just a clear token-format-mapping, with escaping where necessary, in accordance with Python’s implementation of PCRE.

Newlines are ignored (though they increment a counter for syntax-error-reporting), and both whitespace and line-comments (both shell-style and C-style) are stripped, too.

Lastly, an error-handling function is needed to report syntax errors. The interpreted nature of this language makes it feasible and necessary to stop processing on a lexing error, rather than waiting for the semantic phase to stop execution, so an exception is raised at the first sign of trouble:
[python]
def t_error(t):
raise ValueError("Illegal character on line %(line)i: ‘%(character)s’" % {
‘line’: t.lexer.lineno,
‘character’: t.value[0],
})
[/python]

At this point, the language’s tokens have been defined, which basically means that an arbitrary string could be tossed in and we’d get a sequence of nicely separated tokens (or an exception), but it does very little for describing how the language looks or what sorts of structures it will accept. That stuff comes in the form of the grammar specification, and that’s described in the next article.

Leave a Reply

Your email address will not be published. Required fields are marked *