Kotlin: Refactoring to a DSL

There’s a lot of interest in Kotlin right now, thanks to Google’s involvement, and rightfully so. My own voyage with Kotlin started a couple of years ago, when I tried it out by porting an old side project, SPIFF, the Simple Parser for Interesting File Formats. I was very quickly sold on Kotlin as whole chunks of code disappeared, and what was left became much more concise. How overjoyed I was, then, to discover over the last couple of months that for once I seem to have backed the right language horse. Blog posts about the killer features are ten-a-penny, so today we’re going to actually get hands-on, and see how we can refactor some imperative Kotlin code into something resembling a Domain Specific Language.

The aim of SPIFF is to provide a DSL for an executable specification of binary file formats. Examples of these sorts of files would be bitmaps, MP3s, or zip files. Parsing these files requires one to know the specification e.g. the file starts with a literal string, followed by 2 bytes that represent the length of the header, followed by a byte for the length of the description, followed by a string of that length etc. You often need to do some element of conditional computation, so the DSL needs to support some level of logic, branching and looping. In the original SPIFF, this DSL was implemented using JavaCC, and the specification file was “compiled” to a set of instruction objects that were executed at “runtime”. Here’s an example of reading ID3v1 tags from an MP3 file, using SPIFF’s DSL.

# fileLength is a special variable that is made available

# Jump to 128 bytes before the end of the file
.jump fileLength - 128

string('TAG')  tagLiteral
string(30) title
string(30) artist
string(30) album
string(4)  year

# Tricky! ID3v1.1 hijacks the last two bytes of the comment field, one for a null byte
# and one for the track number

# Assume it's v1.1
string(28) comment
byte       zeroByte

# if the last byte was not a null byte, it must have been a v1.0 comment
.if(zeroByte != 0) {
	# go back to where the comment started, and read 30 bytes instead
	.jump &comment
	string(30) comment
} .else {
	# otherwise just carry on
	byte		trackNumber
}
byte 		genre

Hopefully it’s straightforward to understand. The only thing that is perhaps not immediately obvious is that prefixing a variable with an ampersand, such as in &comment, returns the “address” from which that variable was last read. This can be useful to jump back to a particular place in the file, or to calculate how many bytes have been read since a particular point, for instance if a section is padded to a particular length. For the record, SPIFF does not have any variable scope - if the same name is used for a variable in two places, or it’s used in a loop, it simply holds the last value that was read.

Our aim today is to implement the same code, but in Kotlin. Better than that, we’re going to start from a naïve piece of code for reading such files, and refactor our way to greatness! Let’s see how close we can get to the original SPIFF format.

Here’s that first implementation. Hopefully you’ll see at least one or two very obvious improvements we can make. If you’re following along, change the MP3 to a song of your choosing (but you won’t regret it if you seek out a version of Shaking Through).

fun main(args: Array<String>) {

    // get the file, and use NIO to read it into a ByteBuffer
    val file = File("ShakingThrough.mp3")
    val channel = FileInputStream(file).channel

    val buffer = ByteBuffer.allocate(file.length().toInt())
    channel.read(buffer)

    //flip() puts us back at the start of the buffer, ready to read
    buffer.flip()

    //ID3 tags occupy the last 128 bytes of the file
    buffer.position(file.length().toInt() - 128)

    val tagArray = ByteArray(3)
    buffer.get(tagArray)
    val tag = String(tagArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
    if(tag != "TAG") exitProcess(1)

    val titleArray = ByteArray(30)
    buffer.get(titleArray)
    val title = String(titleArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }

    val artistArray = ByteArray(30)
    buffer.get(artistArray)
    val artist = String(artistArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }

    val albumArray = ByteArray(30)
    buffer.get(albumArray)
    val album = String(albumArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }

    val yearArray = ByteArray(4)
    buffer.get(yearArray)
    val year = String(yearArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }

    //mark() sets a point we can reset() to later
    buffer.mark()

    //skip ahead and get the 29th byte
    buffer.position(buffer.position() + 28)
    val zeroByte = buffer.get()

    //go back to where we were
    buffer.reset();

    //if it's zero, this is (or might be) ID3 v1.1, and the comment is 28 bytes
    val comment = if(zeroByte.toInt() == 0x00) {
        val commentArray = ByteArray(28)
        buffer.get(commentArray)
        String(commentArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
    } else {
        //if it's not zero, it's definitely ID3 v1, and the comment is 30 bytes
        val commentArray = ByteArray(30)
        buffer.get(commentArray)
        String(commentArray).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
    }

    //for ID3 v1.1, we get a byte for the track number after the comment
    val trackNumber = if(zeroByte.toInt() == 0x00) {
        buffer.position(buffer.position() + 1)
        buffer.get().toInt()
    } else {
        0
    }

    val genre = buffer.get().toInt()

    println("""
        Title: $title
        Artist: $artist
        Album: $album
        Year: $year

        Comment: $comment
        Track Number: $trackNumber
        Genre: $genre
    """)
}

Run it, and you’ll get this (or something similar for your own file)

        Title: Shaking Through
        Artist: R.E.M.
        Album: Murmur
        Year: 1983

        Comment: 
        Track Number: 10
        Genre: 17

Hey, it works! You can sort of see how the code maps to the original SPIFF specification, with some slight changes when reading the comment so we can make use of the fact that, in Kotlin, if statements return values, which means we can make comment immutable.

If you’ve got any sort of spidey sense, you’ll already be itching to refactor that duplicated lump of code that reads a string from the buffer. ByteBuffer already has methods to read primitive datatypes - getLong(), getInt() etc. - but doesn’t have any methods for reading Strings, and by now you’re possibly yelling “Extension functions” at the screen. Let’s do it.

fun ByteBuffer.getString(length: Int): String {
    val array = ByteArray(length)
    this.get(array)
    return String(array).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
}

And we can implement it in the main function

    ...
    val tag = buffer.getString(3)
    if(tag != "TAG") exitProcess(1)

    val title = buffer.getString(30)
    val artist = buffer.getString(30)
    val album = buffer.getString(30)
    val year = buffer.getString(4)
    ...

So far, so un-DSL-y, but much more terse and readable. Next, we’ll attack that block of set up code at the top. We want our DSL to be the specification of the file, not the nuts and bolts of how to actually get the content from a file. Let’s hide it in another method.

fun binaryFile(fileName: String, callback: (ByteBuffer) -> Unit) {
    val file = File(fileName)
    val channel = FileInputStream(file).channel

    val buffer = ByteBuffer.allocate(file.length().toInt())
    channel.read(buffer)

    buffer.flip()

    callback(buffer)
}

and we’ll use it like this

binaryFile("ShakingThrough.mp3") { buffer ->
    buffer.position(buffer.limit() - 128)

    ... do your stuff ...
}

In Java, you might have this method return the ByteBuffer, and then use it. In Kotlin, we can make use of the fact that functions are first-class, so we can pass them around. More than that, if a function is the last argument, you can take it outside the parentheses. So binaryFile takes a function, into which we pass the resulting ByteBuffer for it to work on. That function is really just the code that was already in our main function, only now it’s within the scope of the binaryFile that we’re working on, and that’s not a bad thing, right? Note that because we only have access to the buffer object, we need to use buffer.limit() instead of file.length(), but for our purposes they are the same thing.

If you’re playing along at home, you’ll see that the code in that block is starting to look a bit more like a specification. But we still have a lot of references to buffer everywhere, which is an implementation detail that the specification shouldn’t really care about. Functions with receivers to the rescue! A function with receiver is just a function that assumes that this refers to the type of object preceding the dot. You already do this without knowing it when you use extension functions. So String.() -> Int is a function in which this will be a String, and which will return an Int. It doesn’t need to be a method already defined on that type - because you’re generally passing these as code blocks to another function, you can consider them as anonymous extension functions. Anywhere you pass a function with receiver, you could instead pass a function literal that refers to a matching method already defined on the receiver type. In the case of String.() -> Int, you could pass String::toInt.

Instead of passing the ByteBuffer as a parameter to the code block, let’s make it the receiver. We just change the type signature of the callback parameter, and instead of callback(buffer), we do buffer.callback().

fun binaryFile(fileName: String, callback: ByteBuffer.() -> Unit) {
    val file = File(fileName)
    val channel = FileInputStream(file).channel

    val buffer = ByteBuffer.allocate(file.length().toInt())
    channel.read(buffer)

    buffer.flip()

    buffer.callback()
}

Because the ByteBuffer is now this in the code block, and this is implicit, we can just take away all the references to buffer

binaryFile("ShakingThrough.mp3") {
    position(limit() - 128)

    val tag = getString(3)
    if(tag != "TAG") exitProcess(1)

    val title = getString(30)
    val artist = getString(30)
    val album = getString(30)
    val year = getString(4)

    mark()
    position(position() + 28)
    val zeroByte = get()
    reset();

    ...
}

Okay, now we’re really starting to get somewhere! What next? Well, that get() is a bit obscure now. It gets a single byte from the buffer. In the original SPIFF DSL, it’s a byte instruction. Methods that read from the buffer should represent the datatype you’re fetching, so get() becomes byte(), getString() becomes just string() etc. We can just define these as extensions on ByteBuffer. We’ll also add a skip() instruction to replace that unwieldy statement that moves 28 bytes ahead.

fun ByteBuffer.string(length: Int): String {
    val array = ByteArray(length)
    this.get(array)
    return String(array).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
}

fun ByteBuffer.byte(): Int = get().toInt()

fun ByteBuffer.skip(length: Int) = position(position() + length)

which gives us:

...
val title = string(30)
val artist = string(30)
val album = string(30)
val year = string(4)

mark()
skip(28)
val zeroByte = byte()
reset()
...

There’s a couple more useful instructions we can introduce here. That mark-skip-read-reset lump of code is a bit ugly. What we’re really doing there is taking a peek a few bytes ahead. To make it interesting, we can genericise that to allow a caller to move in the stream, do anything they want (read a byte, a string, an int etc.), and then return a value, having reset the position to where you were before you started. That calls for passing another function with receiver. In that function, the caller should be able to write the same DSL, so the type signature for that function will stay the same as we use in the binaryFile method, except now that block will return an Any value instead of Unit

fun ByteBuffer.peek(length: Int, callback: ByteBuffer.() -> Any): Any {
    mark()
    skip(length)
    return callback().apply {
        reset()
    }
}

We use the apply idiom here. Without it, we would store the return value from the callback block in another variable, then do the reset(), then return the value. With apply, the reset() is performed before the value from callback is returned. Note that callback doesn’t have an explicit receiver - the receiver is this (the ByteBuffer), so it can be omitted to taste. Also note that inside the apply block, strictly this is the instance of Any returned from callback(), not the ByteBuffer. If we called this.reset(), the compiler borks. But it’s also clever enough to infer that we’re calling reset() on this from the outer scope (the peek method). If you wanted to be explicit about it, you can use this@peek.reset(). Seeing as a large part of writing DSLs is controlling which receiver is in scope at any given point, it’s worth taking some time to ensure you understand these ideas.

We use it like this:

val zeroByte = peek(28) {
    byte()
}

At this juncture, we’ll take a slight detour. So far, we’ve just been defining extensions on ByteBuffer. But soon we may need to store some state of our own, and perhaps methods that don’t really relate to the buffer directly. Also, our extension functions will be available outside of our DSL, so we’re leaking scope a bit. We’re going to define our own class that holds on to the instance of the buffer, and delegates calls accordingly. The binaryFile method changes to make the new BinaryFile class the receiver of the callback function, and the extension methods that we defined on ByteBuffer just become normal members of the BinaryFile class. This means that the BinaryFile class will encapsulate the methods of our DSL, which seems like The Right Thing.

fun binaryFile(fileName: String, callback: BinaryFile.() -> Unit) {
    val binaryFile = BinaryFile(fileName)
    binaryFile.callback()
}

class BinaryFile(fileName: String) {
    val buffer: ByteBuffer

    init {
        val file = File(fileName)
        val channel = FileInputStream(file).channel

        buffer = ByteBuffer.allocate(file.length().toInt())
        channel.read(buffer)

        buffer.flip()
    }

    fun string(length: Int): String {
        val array = ByteArray(length)
        buffer.get(array)
        return String(array).trim { char -> char.isWhitespace() or (char == 0x00.toChar()) }
    }

    fun byte(): Int = buffer.get().toInt()

    fun skip(length: Int) = buffer.position(buffer.position() + length)

    fun peek(length: Int, callback: BinaryFile.() -> Any): Any {
        buffer.mark()
        skip(length)
        return callback().apply {
            buffer.reset()
        }
    }

    fun jump(pos: Int) = buffer.position(pos)

    fun fileLength() = buffer.limit()
}

By adding passthrough methods for position() and limit(), we can make this refactoring without needing to change the DSL in our main method. But because limit() isn’t very domain specific, we’ll rename it to fileLength(), and in the original SPIFF, position() was called .jump, so we’ll copy that.

The final thing we’ll do for today is add a datatype for a literal string. If the specification calls for a literal string at a position, and that string isn’t found, we should clearly stop parsing and throw an exception.

fun literalString(expected: String): String {
    val actual = string(expected.toByteArray().size)
    if(actual != expected) throw RuntimeException("Expected literal $expected, found $actual")
    return actual
}

which we can use thus:

binaryFile("ShakingThrough.mp3") {
    jump(fileLength() - 128)

    val tag = literalString("TAG")

    val title = string(30)
    ...
}

Try changing “TAG” to something else, and you should see it fail.

Here’s our final DSL for now:

binaryFile("ShakingThrough.mp3") {
    jump(fileLength() - 128)

    val tag = literalString("TAG")

    val title = string(30)
    val artist = string(30)
    val album = string(30)
    val year = string(4)

    val zeroByte = peek(28) {
        byte()
    }

    val comment = if(zeroByte == 0x00) {
        string(28)
    } else {
        string(30)
    }

    val trackNumber = if(zeroByte == 0x00) {
        skip(1)
        byte()
    } else {
        0
    }

    val genre = byte()

    println("""
        Title: $title
        Artist: $artist
        Album: $album
        Year: $year

        Comment: $comment
        Track Number: $trackNumber
        Genre: $genre
      """)
}

Hopefully you’ll agree that it’s not a million miles away from the original SPIFF version. Bear in mind that we haven’t done any “DSL magic” here. Simply using common features of the language, namely extension functions and passing functions with receivers, we have achieved code that is concise and readable, which is really all a DSL is.

In part II, we’ll edge things a little bit closer to the SPIFF version, using some operator overloading.