Kotlin: Refactoring to a DSL
There’s a lot of interest in Kotlin right now, thanks to Google’s involvement, and rightfully so. My own voyage with Kotlin started a couple of years ago, when I tried it out by porting an old side project, SPIFF, the Simple Parser for Interesting File Formats. I was very quickly sold on Kotlin as whole chunks of code disappeared, and what was left became much more concise. How overjoyed I was, then, to discover over the last couple of months that for once I seem to have backed the right language horse. Blog posts about the killer features are ten-a-penny, so today we’re going to actually get hands-on, and see how we can refactor some imperative Kotlin code into something resembling a Domain Specific Language.
The aim of SPIFF is to provide a DSL for an executable specification of binary file formats. Examples of these sorts of files would be bitmaps, MP3s, or zip files. Parsing these files requires one to know the specification e.g. the file starts with a literal string, followed by 2 bytes that represent the length of the header, followed by a byte for the length of the description, followed by a string of that length etc. You often need to do some element of conditional computation, so the DSL needs to support some level of logic, branching and looping. In the original SPIFF, this DSL was implemented using JavaCC, and the specification file was “compiled” to a set of instruction objects that were executed at “runtime”. Here’s an example of reading ID3v1 tags from an MP3 file, using SPIFF’s DSL.
# fileLength is a special variable that is made available
# Jump to 128 bytes before the end of the file
.jump fileLength - 128
string('TAG') tagLiteral
string(30) title
string(30) artist
string(30) album
string(4) year
# Tricky! ID3v1.1 hijacks the last two bytes of the comment field, one for a null byte
# and one for the track number
# Assume it's v1.1
string(28) comment
byte zeroByte
# if the last byte was not a null byte, it must have been a v1.0 comment
.if(zeroByte != 0) {
# go back to where the comment started, and read 30 bytes instead
.jump &comment
string(30) comment
} .else {
# otherwise just carry on
byte trackNumber
}
byte genre
Hopefully it’s straightforward to understand. The only thing that is perhaps not immediately obvious is that prefixing a variable with an ampersand, such as in &comment
, returns the “address” from which that variable was last read. This can be useful to jump back to a particular place in the file, or to calculate how many bytes have been read since a particular point, for instance if a section is padded to a particular length. For the record, SPIFF does not have any variable scope - if the same name is used for a variable in two places, or it’s used in a loop, it simply holds the last value that was read.
Our aim today is to implement the same code, but in Kotlin. Better than that, we’re going to start from a naïve piece of code for reading such files, and refactor our way to greatness! Let’s see how close we can get to the original SPIFF format.
Here’s that first implementation. Hopefully you’ll see at least one or two very obvious improvements we can make. If you’re following along, change the MP3 to a song of your choosing (but you won’t regret it if you seek out a version of Shaking Through).
Run it, and you’ll get this (or something similar for your own file)
Title: Shaking Through
Artist: R.E.M.
Album: Murmur
Year: 1983
Comment:
Track Number: 10
Genre: 17
Hey, it works! You can sort of see how the code maps to the original SPIFF specification, with some slight changes when reading the comment so we can make use of the fact that, in Kotlin, if
statements return values, which means we can make comment
immutable.
If you’ve got any sort of spidey sense, you’ll already be itching to refactor that duplicated lump of code that reads a string from the buffer. ByteBuffer
already has methods to read primitive datatypes - getLong()
, getInt()
etc. - but doesn’t have any methods for reading String
s, and by now you’re possibly yelling “Extension functions” at the screen. Let’s do it.
And we can implement it in the main function
So far, so un-DSL-y, but much more terse and readable. Next, we’ll attack that block of set up code at the top. We want our DSL to be the specification of the file, not the nuts and bolts of how to actually get the content from a file. Let’s hide it in another method.
and we’ll use it like this
In Java, you might have this method return the ByteBuffer
, and then use it. In Kotlin, we can make use of the fact that functions are first-class, so we can pass them around. More than that, if a function is the last argument, you can take it outside the parentheses. So binaryFile
takes a function, into which we pass the resulting ByteBuffer
for it to work on. That function is really just the code that was already in our main function, only now it’s within the scope of the binaryFile
that we’re working on, and that’s not a bad thing, right? Note that because we only have access to the buffer
object, we need to use buffer.limit()
instead of file.length()
, but for our purposes they are the same thing.
If you’re playing along at home, you’ll see that the code in that block is starting to look a bit more like a specification. But we still have a lot of references to buffer
everywhere, which is an implementation detail that the specification shouldn’t really care about. Functions with receivers to the rescue! A function with receiver is just a function that assumes that this
refers to the type of object preceding the dot. You already do this without knowing it when you use extension functions. So String.() -> Int
is a function in which this
will be a String
, and which will return an Int. It doesn’t need to be a method already defined on that type - because you’re generally passing these as code blocks to another function, you can consider them as anonymous extension functions. Anywhere you pass a function with receiver, you could instead pass a function literal that refers to a matching method already defined on the receiver type. In the case of String.() -> Int
, you could pass String::toInt
.
Instead of passing the ByteBuffer
as a parameter to the code block, let’s make it the receiver. We just change the type signature of the callback
parameter, and instead of callback(buffer)
, we do buffer.callback()
.
Because the ByteBuffer
is now this
in the code block, and this
is implicit, we can just take away all the references to buffer
Okay, now we’re really starting to get somewhere! What next? Well, that get()
is a bit obscure now. It gets a single byte from the buffer. In the original SPIFF DSL, it’s a byte
instruction. Methods that read from the buffer should represent the datatype you’re fetching, so get()
becomes byte()
, getString()
becomes just string()
etc. We can just define these as extensions on ByteBuffer
. We’ll also add a skip()
instruction to replace that unwieldy statement that moves 28 bytes ahead.
which gives us:
There’s a couple more useful instructions we can introduce here. That mark-skip-read-reset lump of code is a bit ugly. What we’re really doing there is taking a peek a few bytes ahead. To make it interesting, we can genericise that to allow a caller to move in the stream, do anything they want (read a byte, a string, an int etc.), and then return a value, having reset the position to where you were before you started. That calls for passing another function with receiver. In that function, the caller should be able to write the same DSL, so the type signature for that function will stay the same as we use in the binaryFile
method, except now that block will return an Any
value instead of Unit
We use the apply
idiom here. Without it, we would store the return value from the callback block in another variable, then do the reset()
, then return the value. With apply
, the reset()
is performed before the value from callback
is returned. Note that callback
doesn’t have an explicit receiver - the receiver is this
(the ByteBuffer), so it can be omitted to taste. Also note that inside the apply
block, strictly this
is the instance of Any
returned from callback()
, not the ByteBuffer
. If we called this.reset()
, the compiler borks. But it’s also clever enough to infer that we’re calling reset()
on this
from the outer scope (the peek
method). If you wanted to be explicit about it, you can use this@peek.reset()
. Seeing as a large part of writing DSLs is controlling which receiver is in scope at any given point, it’s worth taking some time to ensure you understand these ideas.
We use it like this:
At this juncture, we’ll take a slight detour. So far, we’ve just been defining extensions on ByteBuffer
. But soon we may need to store some state of our own, and perhaps methods that don’t really relate to the buffer directly. Also, our extension functions will be available outside of our DSL, so we’re leaking scope a bit. We’re going to define our own class that holds on to the instance of the buffer, and delegates calls accordingly. The binaryFile
method changes to make the new BinaryFile
class the receiver of the callback function, and the extension methods that we defined on ByteBuffer
just become normal members of the BinaryFile
class. This means that the BinaryFile
class will encapsulate the methods of our DSL, which seems like The Right Thing.
By adding passthrough methods for position()
and limit()
, we can make this refactoring without needing to change the DSL in our main method. But because limit()
isn’t very domain specific, we’ll rename it to fileLength()
, and in the original SPIFF, position()
was called .jump
, so we’ll copy that.
The final thing we’ll do for today is add a datatype for a literal string. If the specification calls for a literal string at a position, and that string isn’t found, we should clearly stop parsing and throw an exception.
which we can use thus:
Try changing “TAG” to something else, and you should see it fail.
Here’s our final DSL for now:
Hopefully you’ll agree that it’s not a million miles away from the original SPIFF version. Bear in mind that we haven’t done any “DSL magic” here. Simply using common features of the language, namely extension functions and passing functions with receivers, we have achieved code that is concise and readable, which is really all a DSL is.
In part II, we’ll edge things a little bit closer to the SPIFF version, using some operator overloading.