Labeled Data: Records and Variants

Records and variants are basic labeled data types. A record might be expressed as `{x:12.1, y:13.7}`, having a value for each label. A variant might be `Damage:11.3` or `Say:”Hello”`, and involves selection of one label and an associated value. We can generally combine these to have a variant of records or a record of variants. Labeled data is convenient insofar as they are self-documenting (human meaningful) and extensible. Most modern programming languages support labeled data types as a built-in feature. A few have explored fluent extensibility under the names “row polymorphism” or “polymorphic variants”.

But assume we have a much simpler language. This language readily supports simple binary products (a*b) and sums (a+b). A binary product is a simple pair of values like `{fst:12.1, snd:13.7}` while a binary sum is a simple choice of two like `Left:11.3` or `Right:”Hello”`. Effectively, we have spatial labels that are neither human meaningful nor extensible. For my Awelon project, this isn’t a hypothetical scenario. Awelon can readily Church encode binary products and sums, and can even provide a conventional syntactic projection, but has no built-in type-level support for labeled data.

Given simple binary products and sums, can we recover the benefits of built-in labeled data types?

I think we can do that and better. But first a short detour. Type `0` (aka void) has no values (perhaps excepting `error`). Type 0 is the identity type for sums, in the sense that type `(0+a)` or `(a+0)` is equivalent in many ways to just `a`. Similarly, type `1` (aka unit) is an identity type for products, and `(a*1)` or `(1*a)` is equivalent in many ways to just `a`. But there are some important *structural* differences. Any non-error value from `(0+a)` will be encoded as `Right:_`. With deep sum type structures such as `(0+(0+((0+(a+0))+0)))` we end up encoding values as `Right:Right:Left:Right:Left_`.

One idea, then, is to encode our label within this Left-Right-Left sequence. One viable encoding is `(RL|RR)*LL` where `RL` means `Right:Left:` and encodes a zero-bit, `RR` encodes a one-bit, and `LL` encodes sequence stop. (This particular encoding has a convenient property of being simple and self synchronizing.) Given our string of ones and zeroes, we may subsequently encode labels using ASCII or UTF-8. This gives us a basis for variants. Similarly, records can be represented by encoding a trie on the label we use for variants. For the earlier example, directly encoding the path would give us a corresponding structure `(1*(1*((1*(a*1))*1)))`. (We might explore alternative encodings to make it easier to extend or enumerate a record.)

Those 0 and 1 types essentially represent the ‘extensibility’ options for our variants and records. We get more choices for variants, more slots for records, while the label encoding itself ensures unbounded extensibility. Conveniently, we can also use a variant label to access a record, which provides a simple basis for ‘case’ or ‘switch’ expressions. We do additionally need syntactic sugar for working with these otherwise onerous structures, and likely a compiler that accelerates them. But there is little need for special type level support. And by modeling these records and variants, we can more readily get various polymorphic features, and the ability to work with labels themselves as first-class, typeable structures.

Anyhow, while I found this idea interesting, and I haven’t seen other people approaching labeled data this way, I wouldn’t be surprised if it’s mentioned in papers older than I am. I’ll be pursuing this further in Awelon project.

This entry was posted in Language Design, Types and tagged , , , , , , . Bookmark the permalink.

Leave a comment