Making a programming language

Today, I write about an attempt to make my dream programming language.

So I decide to make a programming language. Why? Well, I recently began to dislike Java for being very bloated and I didn’t like how Java 9 began implementing their module system¹. I also recently (in the past year or so) have discovered and enjoyed the Nix programming language which is used to describe Nix packages, as well as build instructions and operations to handle them properly. However, by no means is Nix the “perfect” language that I’ve always wanted. I believe the syntax is a little unsettling (who separates elements of a list with a space?) and the lack of basic pattern matching was not desirable.

Features of my dream programming language

To begin with, I have to think about what features I want in my programming language. I start by creating a basic “syntax” file, which outlines what I plan the syntax of the programming language to look like. I decide to base my programming language on the Nix programming language in the sense that:

Sets are the primary data structure in the programming language. They consist of key-pair values and that’s where functions are declared. You have a set of functions. This provides a modular system where you can declare a set of functions in a file, and import it in another file. The use of sets also gives ride to basically any sort of “data structure” you could imagine. For example, if you want to represent a complex number, it would be represented as a set with say, attributes real and imaginary, to represent both the real and imaginary parts of the number.
The programming language is pure. A purely functional programming language makes the code easier to understand and has no side-effects which are difficult to reason with.
The programming language is dynamically typed. Functions can take multiple types (for example, a function can take a string OR a number), without having to specifically stating what they return or receive as input. Despite the fact that having a statically typed language is something I strongly approve of (such as Java), I enjoy having functions that can take any specific type if needed.
The programming language doesn’t depend on indentation. I detest programming languages that use indentation to handle scope.

A programming language based on Nix alone isn’t really what I want though. There are features from other programming languages, specifically Haskell that I enjoy and would prefer to have in my programming language as well. In particular:

The programming language has pattern matching. Having the ability to match expressions against other expressions makes the programming language easier to comprehend and reduces bloatiness with multiple if-else statements.
Guards. Haskell’s guards are basically Boolean based switch cases. Being able to switch based on Booleans is great and is a neater way of doing things as opposed to using multiple if ... else if ... else ... statements.

And that’s basically what I want in a programming language. So let’s make it!

Making the language

Attempt 1: A transcompiler

Instead of making a whole interpreter for a programming language that is basically Nix, I decide to create a transcompiler that converts my program into a Nix expression. Surprisingly, this isn’t actually the easiest thing to do.

I try writing a Java program (Because I’m familiar with Java) that takes in a program with my given syntax and produces a Nix expression that represents my program. However, after about an hour into it, I realise that this is indeed the totally incorrect approach. Trying to write a program that is just a really glorious text stream editor to produce a programming language without any form of grammar or reasoning is bound to fail.

I contact a friend about my endeavors and they recommend that I use JavaCC, the Java Compiler Compiler which is used to create compilers in Java. Knowing full well the difficulties of trying to use JavaCC from a piece of university coursework, I decide against the idea of using JavaCC and instead decide to focus on creating some basic syntax.

Attempt 2: Using JavaCC

About three weeks later, I have the genius idea of using JavaCC to create a parser and interpreter for my programming language. Despite not being exactly what I want, it does have the additional feature of being cross-platform compatible and it’s Java - I’m confident with Java and it’s easy to handle interpreting things broken into an object oriented manner.

So, with any JavaCC project, I begin creating my grammar, declaring my tokens (keywords and symbols, such as comments etc.) and constantly test against my syntax file that I created to ensure that the grammar works properly. If the grammar works, then implementing the interpreter would be a breeze.

Implementing the grammar is pretty simple, you basically state what tokens come after what, and if more of them exist or not. For example, a function call could look like:

myFunction 2 3
(x -> x * x) 5
someConstantFunction

So you implement some basic grammar along the lines of:

functionCall() {} {
    (<VARIABLE> | lambda()) (expression())*
}

So a function call is defined as a variable token (which is defined as say, the regular expression [a-zA-Z]+, representing mixed-case letters), or a lambda, all of which is followed by one or more expressions.

Creating the interpreter

The problem with JavaCC that initially put me off from using it is trying to create an interpreter with it. You use JavaCC to create your grammar which basically tells you “yeah, this is a valid program” or “nah, this program has syntax errors in it”, but that alone isn’t enough to actually do anything. You have to interpret this grammar.

To do so, I decide to use an object-oriented approach (since I’m using Java, why not?) to basically wrap each major expression in some form of object describing it. For example, a set expression would consist of a mapping, (some form of hashtable) consisting of the keys (function names) and values (function bodies), or an operation expression (an expression consisting of an operator such as + or -) would consist of the left and right expressions, as well as the operator which is being used on the two expressions.

By putting everything into objects, it would mean that the entire program would produce a single object which can be passed into some evaluator and produce a result. However, this makes your lovely grammar look awful. Take the function call example from above:

functionCall() { 
    Token variableToken = null; 
    Object lambda = null; 
    Object expression = null; 
    FunctionCall myFunction = new FunctionCall();
} {
    (variableToken = <VARIABLE> | lambda = lambda())
    {
        if(variableToken != null) {
            myFunction.init.setInit(parseToken(variableToken));
        } else {
            myFunction.setInit(lambda);
        }
    }
    
    (
        expression = expression()
        { myFunction.addExpression(expression); }
    )*
    { return myFunction; }
}

Look at that mess! Such a simple thing has been turned into a complete monster of an expression! Nonetheless, I do this for all of my construct and manage to produce some form of final expression which can be evaluated.

Evaluating an expression

Evaluating an expression should be the easy bit - you basically do:

if expression is a variable:
    look up the variable
if expression is a string:
    return the string
if expression is a function:
    call the function
...

So with this idea in mind, I go ahead implementing a method that checks the expression and evaluates it. Everything seems to go well until we run into the issue of function calls. Surprisingly, function calls are not easy. To explain why, we first have to look at function calls in more depth.

Function calls in functional languages

So, with imperative languages (such as C, C++ and Java), function calls are pretty simple. You have a function name and it takes in a certain number of parameters. Really simple. For example, we have a function called myFunction and it takes in the two parameters param1 and param2, then performs some sort of operation on those parameters and returns a value (or not, if the function is void, but we’ll ignore that case for now).

myFunction(param1, param2) {
    //Do stuff with param1 and param2
}

In functional languages, this is “sort of” the case. Say we have our function myFunction again:

myFunction = param1 -> param2 -> someResult

Functions in functional languages only take zero or one parameter. So how does this work with the example above? Well, the example above is more like:

myFunction = param1 -> (param2 -> someResult)

The function myFunction has a single parameter param1 and then calls a new anonymous function (or lambda) which has a single parameter param2 which then evaluates to a result.

So how do we evaluate these weird nested functions? Well, at this point, I ask the computer science department’s expert on functional programming languages and they inform me of a machine called the SECD machine.

Introducing the SECD machine

In a nutshell, the SECD machine (Stack, Environment, Control, Dump machine) is a thing that evaluates lambda calculus expressions. It is designed to basically convert functions into Reverse Polish Notation², whilst simultaneously handling the environment (what parameters have been assigned to what values) properly.

So, like anything I don’t understand or have never heard of before, I head to the internet and find an article on Wikipedia that explains what the SECD machine is. I stare at a wall of text, which includes an “informal description” of how a SECD machine operates. I expect a picture or so, showing a step-by-step tutorial on how it works or something, similar to how Brilliant.org explains the Shunting Yard Algorithm, which is so clear and easy to read and understand.

Unfortunately, I don’t have that luxury and neither did a Google image search yield any promising results. From here, I decide to whip out my whiteboard and write out a dry run of the SECD machine using a simple program:

add = x -> y -> x + y;
add 2 3

With that, I begin to implement the SECD machine, and eventually manage to get function calls working in my programming language.

Syntax highlighting

With my programming language underway, it came to the point where I would be writing a lot more code in my programming language than Java, since the majority of it is working as intended. I decide to create some form of syntax highlighting to help identify my code (since my interpreter has little to no error messages other than Java exceptions).

As I am using Visual Studio Code, I research how to create syntax highlighting for a custom programming language. I read about how VS Code uses TextMate grammars to identify programming languages and essentially, if I want syntax highlighting, I have to create a TextMate grammar for my programming language. I read about how to set up a project to do so, luckily there’s some simple program you can run that sets up project directories and files to create a VS Code extension.

After setting up my extension directories and files, I soon learn that the rest of the VS Code extension syntax highlighting guide doesn’t actually tell you how to make a TextMate grammar - all it does is link you to various other webpages that explain how to do so, which are rather content heavy, but explain how to create a grammar rather easily.

I create my grammar for my programming language and (after updating VS Code), see it work in action and bask in the glory of having colour coded code.

Feature creep

Wikipedia’s definition for feature creep is pretty good:

Feature creep is the excessive ongoing expansion or addition of new features in a product, especially in computer software, videogames and consumer and business electronics.

During the process of creating my programming language, new ideas constantly came to mind of things I could add to improve the functionality of the language. Since nobody was going to stop me, I decided to implement a few extra features for … aesthetic purposes.

Nested comments

Initially, comments in my programming language had two different definitions, as common with most programming languages:

# Single line comments use a single hash

###
Multi lined comments
Use three hashes, and
end with three hashes
###

It was truly a game changer when I discovered that Haskell lets you have block comments nested within block comments. For example, in Java, the following comment below would cause a compilation error as the first opening block comment token /* matches with the first closing block comment token */, which leaves an extra */ hanging.

/*
    /* Some comment */
*/

With this idea in mind, I decide to implement nested comments into my programming language. It seems easy at first - if you’ve ever done any study of computer science data structures, you will probably have learnt of a “stack” structure, which can be used to keep track of say, opening and closing brackets of a program, to figure out which opening bracket matches with which closing bracket.

However, with a programming language defined using a grammar, stacks aren’t really “there” to enable nested block comments. Basically, the problem boils down to trying to create a regular expression that allows for nested comments. It’s very difficult, hence why most programming languages do not have support for nested block comments.

Nevertheless, JavaCC does in fact have support for such complex grammars, by utilizing lexical states. This basically means you can have a state where your parser reads your program, then when it reads a token (say, an opening block comment), it can switch to (for example) a “comment mode”, where everything read is now read as a comment, instead of a regular program. I can then implement some simple code that adds an opening comment to some structure, and remove it when a closing comment is found. If the structure is empty, then it exists “comment mode” and begins to read the rest of the program.

Since I searched far and wide across the internet to solve this specific problem, I think it’s only fair to share how I solve it:

  
SKIP : { "#[" : BlockComment }
<BlockComment> SKIP : { < "#[" > { Compiler.innerComments++; } }
<BlockComment> MORE : { < ~[] > }
<BlockComment> SKIP : { < "]#" >
    {
        if(Compiler.innerComments == 0) {
            SwitchTo(DEFAULT);
        } else {
            Compiler.innerComments--;
        }
    }
}

Step by step, using a counter as our “stack”:

If it encounters a #[, skip it (do not try to interpret it). Then, enter the lexical state BlockComment
If it is in the lexical state BlockComment and it reads a #[, increase a counter (Declared as an integer within the Compiler class).
If it is in the lexical state BlockComment and it reads anything else (Denoted by ~[]), continue
If it is in the lexical state BlockComment and it reads a ]#, then if the counter is 0, then it is now at the last closing block comment, and should interpret the program as normal by switching to the default lexical state. Otherwise, it has found nested closing block comment and should decrease the counter.

Type annotations

Having a dynamically typed language does make things look elegant, for example an add function that sums the values of two parameters:

add = x -> y -> x + y;

However, I found that with Nix, I struggled to grasp what functions required as input in order to run them. The only way to properly find out was to use ~~the (non-existent) documentation~~ trial and error in the Nix repl. So, inspired by Python, I decide to add type annotations that perform extra checks at runtime to ensure that parameters are of the right type (unlike with Python, where it doesn’t even check it!).

It only serves as a simple error checking method, which fails to run if one of the parameters supplied is not of the right type, however I personally think it is a nice touch and helps programs be “safer” and makes it easier to understand for other developers that read your code.

Personally, I think this:

add = x:num -> y:num -> x + y;

is significantly more elegant than this:

add = x -> y -> 
    if (isNumber x && isNumber y) then x + y
    else error

No numerical restrictions

So, my programming language (unlike say, Java) doesn’t have any differentiation between integers and decimal numbers. All integers are converted into decimal numbers (e.g. 1 becomes 1.0) for the sake of evaluation. However, with any floating point arithmetic, floating point errors (inconsistency with decimal numbers, due to computers not being able to handle them properly) occur.

In addition, most programming languages (Java, C etc.) have a maximum range on the size of the largest number it can store. However, Python does not have this restriction, so my programming language officially has some inspiration from Python.

Instead of using a double or a long to store integers and floating point numbers for my programming language, I decide to use BigDecimal - a class provided by the standard Java library that lets you use arbitrary precision numbers (numbers of any level of precision). Of course, it also means that it has no upper bound or lower bound. However, an issue with the BigDecimal class is that dividing numbers can produce errors. For example, if we have the following code:

new BigDecimal(1).divide(new BigDecimal(7))

If you try to run it, it would produce an ArithmeticException , since the number cannot be represented (it has an infinite number of decimal places). Therefore, you have to include a “scale”, which is basically how many decimal places your number can have. Obviously, 128 sounds like a fair number, so my numbers in my language officially have up to 128 decimal places of precision. Because why not.

Pitfalls

With a project as ambitious as this, you can’t possibly expect everything to go exactly as you plan it to, and indeed, there are some major flaws in the design and creation of this programming language.

Lazy evaluation

Since the start, I want my programming language to be lazily evaluated, similar to Haskell and Nix, which is where my language got its inspiration from. Lazily evaluated programming languages have a few sneaky perks, in particular infinitely sized data structures, which would be useful for evaluating infinitely large trees of sets. The major downsides of using lazy languages is that it’s difficult to find out where certain parts of the program are evaluated, and can lead to minor memory inefficiencies - but I’m writing this in Java, so memory is not the slightest concern.

During creating this programming language, I find that my language has some parts which are evaluated lazily, and some parts which are evaluated strictly (as soon as possible, as opposed to when needed). After noticing this, I begin to change things to lazy evaluation before reaching a road block where I found the implementation of lazy evaluation too complicated and a lot of my programming language failing to evaluate anything properly. By a “road block”, I mean that I have to basically rewrite the entire evaluation system from scratch.

As of that, my programming language is officially strictly evaluated.

Brackets

Towards what I believe is the end of the creation of this language, I realise that my grammar is not as well defined as I originally thought. Function calls are not being interpreted properly, the order of operations are out of whack and function calls inside function calls are not being evaluated properly. Take the following code for example:

add (add 2 3) 4

In order to evaluate this, it needs to interpret each argument at a time. Or say we have the following code:

3 * (1 + 2)

This does in fact produce the answer 5 when run by my language, whereas it should produce the answer 6. What is the problem here? My programming language completely and utterly ignores brackets.

If statements

One of the fundamental features of any programming language is the if statement, which is used to manage control through a program. Surprisingly, I realise this only after I implemented standard set operations and multi-lined strings. My solution to this? Completely ignore it. That’s right, my dream programming language just doesn’t have if statements, because I forgot about it.

Luckily, using guards allows me to essentially perform if statements, by checking predicates (Boolean expressions) and evaluating certain results based on the outcome. After realizing that guards are basically glorified if statements, I decide to keep guards as the only way of managing conditionals over Boolean expressions, as I dislike having a language where there are multiple ways to do a certain action (For example, trying to open a file in say, Java compared to say, Python).

Purity checks

If you’ve made it this far into the blog post, don’t worry, this is the last section before the conclusion. With this programming language, I want to have a system that detects whether your code is pure or not. With the Nix programming language, despite it claiming to be purely functional, various built in functions, such as fetchTarball or path perform impure operations. In order to correct this, I decide to have all functions that perform something that is impure have to be declared with an @ symbol before them (For example, @print), and any set that contains an impure function must be declared with the keyword impure.

Ideally, this would let you know beforehand whether the code you’re writing or using performs any impure operations and you can make judgements on whether you will use it or not. However, the functional programming lecturer informed me that this strategy will not work, because of higher order functions.

Say we have a function map, which takes in two inputs: a function and a list. It applies the function to each element in the list and returns the new list, for example: map (x -> x + 2) [1, 2, 3] would result in the list [3, 4, 5].

But what if the function we supply to the map function is impure? Would the map function be deemed impure? After much discussion with my functional programming lecturer, they informed me that there isn’t really a single solution for this, however there are solutions that do “work”. In particular, monads which basically shove things in containers, and effect systems, which are similar to the throw Exception and try {} catch(Exception) {} constructs in Java.

After much debate on whether I should or should not implement an effect system into my programming language, I decide to leave the purity checks up to the developer. Not quite what I want, since knowing about whether code is pure or not is something I really wanted, but the additional complexity of adding an effect system is not something I plan to add anytime soon.

Conclusion

So I made a programming language. It’s not perfect - there are still defects here and there which need to be ironed out. But the things that I’ve learnt by making it is truly something else. I’ve learnt about programming language design, various ways to evaluate expressions, various ways to write complex algorithms when I don’t get it the first time and what it’s like to have feature creep in a project! I hope by sharing this experience with you, you too will learn a bit about something about making a programming language and I hope you don’t make the same mistakes that I did.

I’m sure the module system is great, but since Java 9 is still very “new” and not widely used, and the fact that Java package managers such as Gradle or Maven exist, it seems like a pointless necessity that overcomplicates how Java projects are already used. ↩
Reverse Polish Notation is basically a sequence of operators and operands that makes expressions easier to interpret. Instead of say 1 + 2, you would use 1 2 +. Or say, instead of 5 * (2 + 3), you would have 5 2 3 + *. You can read more about Reverse Polish Notation and how it works here. ↩