Making a better programming language

Today, I write about making my dream programming language. Again.

Writer’s Note: I recommend reading my first post Making a programming language before reading this one! This post contains complex jargon which is explained in that previous one.

Towards the end of summer last year, I created a programming language which I called “my dream programming language”. It was a purely functional programming language with dynamic types that included the core features of set data structures, pattern matching and guards. At the time, I had some basic knowledge of JavaCC and created an interpreter for the language using a SECD machine. However, it had a number of pitfalls:

Brackets are ignored, so the order of operations are not handled
There are no if statements, which makes the language difficult to use for new users
Despite being called a ‘pure’ programming language, it was not actually pure

So, this time I decided to create a better language, its successor so to speak, which resolves these issues and takes advantage of the new knowledge that I’ve gained by taking my university’s compiler design module.

A new language design

So, what does my new dream language consist of? What features does it definitely require for me to call it my dream language? After being exposed to the functional programming language Elm and after writing a purely functional programming language called the Untitled Lambda Calculus for my third year project (a smart package manager), I now have a much better idea of what my dream language consists of:

Pure. A pure programming language is one where its functions have no side effects (no mutation, or I/O) and its return value is the same for the same input arguments. My previous language was initially pure, but the introduction of trying to handle I/O caused it to be impure. Pure languages also have the perk of being easy to read and reason about, which can help developers write functions (and prove they work) easily.
Java interoperability. In the current world of programming languages, having a programming language that doesn’t have some form of interoperability with another language is pretty much useless. For example, take Kotlin. Kotlin can be used on the JVM, allowing it to call Java classes (and vice versa), but it can also compile down to JavaScript or native code via LLVM. Since I primarily program in Java and often want to code in a functional manner, having a method to call functional code is ideal without having to use additional libraries or frameworks.
Lazy evaluation. I love lazy evaluation. The idea of being able to only call what you actually need when you need it seems very smart and since my original programming language failed to have lazy evaluation, implementing lazy evaluation will instantly make this language better than my previous one.
A compiled language. My first programming language was interpreted using an SECD machine, however I found that I didn’t like that very much. My third year project’s Untitled Lambda Calculus was interpreted by reducing a given expression until it could not be reduced further. The downside of both of these methods is that they are currently interpreted. In order to have some form of Java-based interoperability, I would either have to write an interpreter for the language, or have it compile down to Java bytecode, in which Java can directly call functions from the language. Since Java bytecode is something that I’ve wanted to learn for some time, I want to try my hand at generating bytecode.
Less bloat. The biggest problem that I have with other programming languages that run on the JVM, such as Scala, Kotlin or Groovy is that they’re incredibly bloaty. If you have to ship an additional entire standard library (such as the Kotlin runtime) that consists of a tonnes of everything that’s pretty much already existent in Java (I’m looking at you kotlin.collections), then I don’t believe it’s a proper solution. It’s on the same level of converting a web application to a desktop application by shipping it with chromium (Who thought this was a good idea?!). My dream language must have as minimal bloat as possible.
Static types. My first language had dynamic types, but I found that with static types, it’s easier to create much more complex data structures. In addition, my third year project pretty much depends on static types and I thought that having a (decent) language that could interact with my third year project would be cool sub-project to undertake.

So that’s pretty much what I want in this programming language, so let’s make it!

Making the language

After completing my third year project, I decide to take the advice of my supervisor so I don’t make the same mistakes that I did when I created my first programming language and start small, then expand from there. I start with my trusty JavaCC project template to get a simple project structure with JavaCC. From there, I begin to plan out what the syntax would look like.

A new syntax

Since this programming language is pure, I decide to make it such that you can only call functions from some other programming language that runs on the JVM. This basically means that the language can’t “start” by itself - it has no main method. In other words, it’s a language used for creating libraries.

I decide to go with the Elm-like style for “exposing” functions, where functions and types to expose are declared at the top of the file. Since I want my language to be Java-like, I decide to use the keyword export (since it’s just the opposite of import).

For the syntax of function declarations, I want to keep the essence of Java’s syntax, but modify it to be some sort of functional/Java-like hybrid. I decide to use the < and > symbols which are used for generics in Java to refer to types in general, where nested types just consist of nested < and > symbols. In general, I don’t want to use : or :: to state types (used by Elm and Haskell respectively) since that’s a very ‘non-Java-like’ syntax.

  
Library MyLibrary {
    export identity;
} {
    identity input<Int> -> <Int> = input;
}

Now, I know it’s not the most beautiful syntax at this point, but hey, it’s subject to change if and when I feel necessary. While I am at it, I decide to to draft up syntax for guards, pattern matching, type declarations and enum definitions. With the syntax somewhat thought up, I decide to work on the parser. I’m much more confident with changing the grammar of the language as I was last summer, so I’m okay with having a non-concrete syntax for now.

Creating the language from the syntax

Initially, I try writing a parser to parse all of the constructs of the language - numbers, strings, lists, sets, function calls, type declarations, guards, pattern matching… you name it. After taking a break from development, I decide to start from scratch and implement each feature one at a time through to the end. By that I mean that each feature will be added in the following cycle:

Write the JavaCC code to parse it
Create the AST node to store the code structure
Write the code generation function for the code structure
Test it by calling it in a Java program

Since I’ve parsed many languages in the past, creating the parser for the basic constructs is simple. Similarly, creating the AST nodes is trivial. The difficulty lies in writing code to generate bytecode - I’ve never done it before!

Introducing ASM

For well over 3 years, I’ve known of ASM, a Java bytecode engineering library. It basically lets you create Java bytecode within Java. I personally believe that choosing to use ASM rather than convert the code to Java, then compile the generated Java code is much more interesting and could lead to better code optimization that would be lost if I tried to emit Java code.

To start off with, I begin with the ASM user guide, which is a very detailed and surprising well written PDF document that outlines how to use ASM to its full potential. Honestly, I’ve never read documentation so good in a very long time! It includes well explained diagrams and examples to help you understand what you need to know to fully use it. By combining this resource with the Wikipedia article on the Java bytecode instruction set, I feel ready to write bytecode! (In hindsight, I should have used the much more detailed official JVM instruction set page)

I also install Eclipse’s bytecode outlining plugin which helps analyse bytecode and also includes a feature to display bytecode, as well as the JBC Java bytecode plugin which lets you edit bytecode for compiled classes in Eclipse. Whether it will be useful, who knows, but it’s good to have it there in case I need it.

I discover ASM’s main workflow of having these “Visitor” methods which basically let you write information about classes. You start with a ClassVisitor which lets you define a class file, and then “visit” individual parts to add extra information to it. For example, to add a new method, you call classVisitor.visitMethod and begin writing code into the MethodVisitor. It sounds complicated, but it’s actually pretty intuitive - you’re basically writing lines of bytecode one at a time.

While writing this ASM, I encounter two major obstacles: line numbers and frames.

Line numbers

In Java, when an error occurs (due to a thrown exception for example), normally a stack trace is printed.

  
Exception in thread "main" java.lang.NumberFormatException: For input string: "3.1415"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at ExampleCli.parseNumericArgument(ExampleCli.java:47)
    at ExampleCli.parseCliOptions(ExampleCli.java:27)
    at ExampleCli.main(ExampleCli.java:11)

Stack traces show the name of the class, the method name and most importantly the line number where the code went wrong. Having this line number is crucial for debugging where your error is in your code. Whether my language will have errors or not is still something I have to find out myself (with it being pure and all that, it shouldn’t have errors, but maybe I want undefined like in Haskell?)

In order to add the line number to a statement in a method, you basically need to use the following block in ASM, where LINE_NUMBER represents the line number:

  
Label lineNumber = new Label();
methodVisitor.visitLabel(lineNumber);
methodVisitor.visitLineNumber(LINE_NUMBER, lineNumber);

In order to do this, I need to actually record the line numbers in the first place! Luckily, JavaCC records the beginning and end line numbers for tokens (beginning is the line of the first character of a token and end is the line of the last character of a token). By adding line numbers to my AST nodes, I am able to record line numbers so in the event that everything goes pear-shaped, people can find out where their code went wrong.

Frames

Honestly, this section is so confusing I pretty much don’t understand it. If you don’t want to get confused by my confusion, you can just read the last paragraph of this section. If you want to have a bad time, read section 3.1.5. Frames from the ASM guide.

In Java, you have threads. Each thread consists of an execution stack which is where instructions are executed. The things that actually go onto the stack are called frames. Every time a method is called in Java, a new frame is pushed onto the stack and when the method returns, that frame is popped off of the stack. Without going into too much detail, there’s also this set of stack map frames which basically says “this is what the type of the frame is after instruction X”. So for example (this is a big of a simplification, but bear with), if you call a method that returns an integer, the type of the frame would be an integer.

As of Java 6, you basically need to include this set of stack map frames which should indicate the state of frames before jump operations, and instructions that follow jump instructions. Trying to figure out what the type of frames are between jumps seems to be an incredibly complicated task - especially since I’m implementing a functional language which is wildly different to Java’s imperative structure.

Luckily, instead of trying to figure out all of the complicated structure of what state is what after what instruction, ASM includes this flag COMPUTE_FRAMES that simply computes the structure stuff for you.

Conclusions

Learning the ins and outs of ASM definitely took some time (a few days), but it was no insurmountable task. The well written guide definitely helps in making sure you know exactly what each method in the ASM library does and once you understand how the JVM works under the hood (as a stack machine), it is surprisingly simple to use. Of course, since you’re dealing with a much lower level of programming, trying to understand jumps (such as if statements) requires a bit of thinking, but this is what I love the most about programming.

I am still in the process of writing this new programming language. The syntax is still very on the fence, there are some things about it that I’m unhappy with. Overall, I’ve gained invaluable knowledge about how to use ASM and the Java bytecode instruction set. Even though the project is in its infancy, I believe it will be able to grow into a much more powerful programming language that I hope to use in future Java projects!

Making a better programming language

A new language design

Making the language

A new syntax

Creating the language from the syntax

Introducing ASM

Line numbers

Frames

Conclusions

Further Reading

Using GitHub Actions

Making the Elm-Brainfuck-IDE

Java Tips & Tricks (1/2)