neversleeps.moscow

by Nickita Biryulin

I wrote a parser for Redis protocol so you don't have to

There are a lot of "Write your own Redis" guides online. The general idea is neat: you get to write a simple KV storage and a network interface for it in a guided fashion. You will work with network and file I/O, do some bytes encoding to support Redis wire protocol, write a simple data structure to hold your data. It's a nice way to get acquainted with a new language.

In this spirit, I decided to write my own Redis. To be honest, I just wanted to implement a Log-Structured Merge Tree, but it occurred to me that it would be nicer with an interactive interface, so I could use it from other software. My plan was simple: Redis is an established product, it's protocol (also called RESP - Redis Serialization Protocol) is well-documented and has to be relatively simple. Considering amount of "Redis-like" software out there, I expected to find a dozen different parsers to choose from.

However, I soon found out that this is not the case. At least for Go, there are a couple projects on Github, but they all share some of these properties:

Integrated into existing databases, so you can't just use them as a library
Archived 7 years ago
Have little to none testing
Don't implement half the specification

In the end, I shelved the whole LSM Tree idea and decided to write a RESP parser/encoder. Should be easy, after all.

The good

RESP basis

RESP is a text-based protocol. It's very easy to understand what is happening under the hood, and interacting with your server doesn't require any tools more complicated then netcat. It also doesn't have any complicated semantics. All types start with a single byte identifying a type, and end with a CRLF terminator (\r\n).

For instance, encoding of string MEOW looks like this: +MEOW\r\n. + means that all text until the next \r\n is a string. Simple enough. There also similar representations for errors (-ERROR\r\n) and integers (+42\r\n).

Some types have information about their length encoded. If you want to send some binary data, you have to use a "Bulk string" - you essentially tell your server that "Next n bytes of the message are data". Looks like $8\r\nDEADBEEF\r\n.

Finally, if you want to send an actual command to the server, you use an array. These are just a bunch of other RESP objects, along with their count. Simple Redis request can look like this: (in this and other examples, newlines, tabs and comments are added for readability)

*2\r\n         // Array of size 2
    +GET\r\n   // Simple string
    +MYKEY\r\n // Simple string

At first glance, all these objects are straightforward to implement. Using them you can easily build a nice KV interface. They are all part of RESP2 protocol, which has been supported by Redis and it's clients for a while. But there are some nice-to-haves missing: there are no defined ways to send a floating-point number, or a map type. Thankfully, people at Redis thought about that and gave us RESP3. It's an extension of RESP2 which should solve most of it's drawbacks.

RESP3

For one, now you have proper maps. They are pretty much as a sequence of Key-Value pairs. Simple and effective, I like it:

%2\r\n        // Map with 2 Key-Value pairs
    +foo\r\n  // Key 1, Simple string
    :0451\r\n // Value 1, Integer
    +bar\r\n  // Key 2, Simple string
    :4212\r\n // Value 2, Integer

You also get a double type, which you use to encode floating-point numbers. Nice stuff: ,33.77\r\n. You also have all your usual shenanigans, like ,-inf\r\n and ,nan\r\n.

However, RESP3 also added a bunch of other types. Some of those are not very elegant, while others highlight all the weird crutches you have to implement when you are developing a backwards-compatible database for 16 years straight.

The bad

Bulk Errors

Remember Bulk Strings? You should use those in two cases: either your string contains raw bytes, or has a length big enough to warrant knowing it's size before parsing. RESP3 introduces Bulk Errors: !3\r\nERR\r\n. Official docs describe it as "combining purpose of simple errors with the expressive power of bulk strings". To me, it looks like a solution for a problem which shouldn't exist: errors are here to inform user about something going wrong, they shouldn't contain bytes or be excessive in size.

Bools

You also get a proper Bool type. Looks nice: #t\r\n. The only problem is, Redis needed to return True/False long before RESP3, so there was already an obvious solution to a lack of Boolean: integers. One is true, zero is false, simple enough. It's not even some kind of optimisation: "True" is 4 bytes as RESP Boolean, and 4 bytes as RESP Integer (:1\r\n). So now you have two ways to express a boolean type, and for no good reason.

Integers

Speaking of integers: you now have "Big number" type. Same thing as "Integer", but now it can contain numbers bigger then int64: (3492890328409238509324850943850943825024385\r\n. Understandable solution to preserving backwards compatibility, but now you have two ways to send an integer.

Attributes

RESP3 also adds "Attributes". These are essentially a Map type, except that those contain additional properties of an object. Here is an official example (!):

|1\r\n                   // An Attribute map with 1 Key-Value Pair
    +key-popularity\r\n  // Key: Simple String
    %2\r\n               // Value: Map with 2 Key-Value pairs
        $1\r\na\r\n      // Key 1: Bulk string "a"
        ,0.1923\r\n      // Value 1: Float
        $1\r\nb\r\n      // Key 2: Bulk string "b"
        ,0.0012\r\n      // Value 2: Float
*2\r\n                   // An Array of size 2, to which Attributes are applied
    :2039123\r\n         // Integer
    :9543892\r\n         // Integer

You have an array of two numbers, and this array has an attribute "key-popularity", which describes popularity of it's contents (why are those attributes attached to the array instead of integers itself? I have no idea, but this is not related to RESP-itself, so we will ignore it).

Can you notice the issue? Normal RESP is parsed in a recursive fashion. You usually have an Array at the top, which contains some objects, which might contain more objects, and so on. Attributes suddenly add an exception to this rule: they don't "contain" the object they are describing, instead they are just attached before it. This is not that big of a deal, but it seems like a strange design choice.

Verbatim strings

Another addition are "Verbatim strings". They are described as "similar to the bulk string, with the addition of providing a hint about the data's encoding". This sounds very nice: I've worked with old projects, where UTF-8 is mixed with other weird encodings, and having an ability to mark the encoding of a certain string would be very useful.

Let's look at the official example: =15\r\ntxt:Some string\r\n. The txt part represents the text encoding. There are two problems with this approach:

Standard requires encoding be expressed in exactly 3 bytes. Encoding names are usually not 3 bytes (see: utf8, win1251).
Apparently, even if you were to cram your encoding name in this field, those are not actual encodings. Instead, this is a bit of information passed to the client to let it know how to present the data to the user.

So now you have another type of bulk string, which is supposed to have text encoding embedded. But instead of proper encoding, they are just putting an attribute describing how to best present this text. If only there were a special data type which would allow us to add additional attributes to objects! We could even call it attributes.

Sets and Pushes

Another structures introduced are "Set" and "Push". Sets are, as expected, a collection of elements with no particular order which is guaranteed to only have unique elements. Pushes are a way to sidestep traditional request-response model and push some data to the client. Let's compare representations of these data types with some test data:

Data Type	Representation
Array	`*2\r\n+FOO\r\n+BAR\r\n`
Set	`~2\r\n+FOO\r\n+BAR\r\n`
Push	`>2\r\n+FOO\r\n+BAR\r\n`

Those are all arrays, just with different first byte! I can get the excuse for Pushes, but I can't fathom why would you use Sets. They express exactly the same information as Arrays. Protocol declares them to be unordered and unique, but this should be a property of the function which returns a set, and not one of the protocol itself. You can send non-ordered, unique data with arrays just fine. You can also try to mess the "set" ordering and see which applications break first.

To sum up, RESP has 2 ways to express a string, 2 ways to express a error (one of which is cursed), 3 ways to express an integer (Integer, Bug Number and Double) and two ways to express a Bool (Integer and Boolean). You also get two types which are maps, but for different purposes (Map and Attribute), one of which is also cursed, and three types of Arrays (Array, Set and Push), one of which is fully redundant. However, that's not all!

The ugly

Nulls

Redis needed a way to express a "Nothingness", i.e. Null. When you fetch a key which doesn't exist, you shouldn't get an empty string as response (how would you differentiate that from an existing empty value?), but they also didn't want to send a error. Problem is, original RESP didn't have a type for Nulls, so they decided to do the simplest thing possible and use an existing type.

That is how "Null bulk string" was created: a string of size -1 which represents null ($-1\r\n). But one type of null was not enough, so some commands might return "Null Array" instead, as you probably guessed, it's an array of size -1: *-1\r\n.

RESP3 aimed to fix this mistake, and added a separate "Null" type. Simple enough: _\r\n. Problem is, other two types are still a part of the standard. Because one way to express nothingness is not enough, RESP has 3.

Streamed data types

RESP3 introduces the concept of "Streamed type". Let's say you are doing some slow search, and want to stream the data to the user in real time. You don't know the amount of items to be returned yet, to you can't use normal arrays or strings. Instead, you use aggregate types: these are essentially just arrays with ? as size. Their end is instead marked by a special object of type "END" (.\r\n). Looks simple enough, and can be useful in some applications. Example:

*?\r\n      // Array of undefined size
    :1\r\n  // Integer
    :2\r\n  // Integer
    .\r\n   // END Type

Problem is, it's not properly documented. They are mentioned in official Redis RESP description, but they are not described there. In fact, I only learned about those when writing this text, because I stumbled on some docs published on GitHub which describe it.

Conclusion

RESP is not hard to work with. However, it's not that well documented, has some weird data types and can represent the same information in a multitude of different ways. It works for Redis, but if you are building your own thing, whatever this thing may be, I suggest that you stay away from it.

If you still want to build something using RESP, and you are building it in Go, I did end up writing a parser/encoder for it: keddad/kresp. It is decently tested, and supports all of the shenanigans described above (except for streamed types, because they are not part of the spec at Redis website).