Structure Editors for Wikilon

I have a new design that vastly improves upon my older extensible syntax and embedded literal objects approaches, and brings Wikilon many steps closer to my long term HCI goals for Awelon project. And it’s surprisingly simple!

Every word in the Wikilon dictionary is now defined by a pair: a compiler function, and a structured value, such that compiling the value returns a block – a first-class function – which is accepted as the meaning of the word. Both elements of this pair are encoded in Awelon Bytecode (ABC), albeit with access to the dictionary via unforgeable `{%foo}` tokens.

Structured values in ABC are constructed of only a few basic primitives:

  • numbers – rational, precise (unless weakened)
  • products (a*b) – pairs of values
  • sums (a+b) – ~ Haskell Either type
  • unit (1) and void (0) – structured identities
  • blocks [a→b] – first class functions
  • discretionary sealers {:fooType} – ~ Haskell newtype

These are the basic types. Floating point numbers might be requested via annotations, e.g. {&float}. Strong, cryptographically sealed values may be accessible via side-effects. But we don’t really need to consider anything beyond the basics for our dictionary and compilers.

The simplest possible compiler function is the identity function, which works when the structured value for the definition is just a block of code.

:swap [rwrwzwlwl]
:dup [r^zlwl]
:swapd [rw {%swap} wl]
:rot [{%swapd} {%swap}]
:dupd [rw {%dup} wl]
:over [{%dupd} {%swap}]

Conveniently, this trivial ‘identity function’ compiler corresponds closely to the (now deprecated) Awelon Object (AO) code. Compare AO code as it exists today:

@swap %rwrwzwlwl
@dup %r^zlwl
@swapd %rw swap %wl
@rot swapd swap
@dupd %rw dup %wl
@over dupd swap

To directly read and write this raw ABC in place of AO code is doable. Indeed, I’ll probably do some of that when getting started in this new model. However, the growing noise from `{%foo}` wrappers and inconvenient number encodings (e.g. `#1234#1000/*l` is the ABC equivalent to `1.234` in AO) would grow wearisome. Fortunately, in context of a structure editor, I think we could present blocks as an AO variant to the user, even rendering `1.234` as appropriate, and perhaps fading out raw data plumbing code. And hyperlinking words, of course. :)

But we aren’t limited to the trivial compiler. We can use arbitrary, ad-hoc structures to represent our word definitions… and we can render the definitions appropriately, i.e. using lists and dictionaries and so on.

The `{:fooType}` discretionary value sealers provide a convenient mechanism to denote that a sealed substructure might benefit from a special rendering and editing widget. This can be very ad-hoc, driven by convention and reflection. E.g. if we want markdown text to use special syntax highlighting and editing tools, we could model this as a text literal followed by `{:md}`, and allow our editors to know what is expected in this case or reflect back into the dictionary searching for words named something like `edit:md`. This should achieve most of the advantages of embedded literal objects, while avoiding most of the weaknesses.

This new approach has many other important advantages:

First, the use of `{%foo}` tokens to access words makes trivial a global renaming of words. Simple renaming is actually what led me to this design. My earlier approach to extensible syntax left word-recognition to a user-defined parser function, which made it too difficult to recognize, refactor, or replace words in a more generic sense.

Second, definitions modeled this way are very scalable – i.e. because it’s easy to use generic mechanisms for compression, linking, automatic refactoring, structure sharing.

Third, it’s easy to save arbitrary data structures into the dictionary to simplify the flow of application-generated data (or even whole application-snapshots) back into reusable code.

Fourth, it’s easy to reuse compiler functions, or even embed them (e.g. constructing a pair of a foo value and the foo compiler), when modeling higher level languages. Thus, languages are readily compositional.

Fifth, with respect to my long term HCI goals, it’s easy to model user input as streaming code, functions that operate on the structured value. There is no need to repeatedly parse and pretty-print. We can abstract a stream into new user macros and editing tools, or as ad-hoc reusable functions.

This technique is, in a broad sense, similar to storing an AST in the database. But it also differs in ways that I consider interesting: opacity of block structures to the compiler function, easy support for user-macros to construct and manipulate the structured values, absence of predefined structure, and easy structured reuse of languages. I believe this new approach to extensible syntax and structured editing will prove a good long-term fit for Awelon project.

Posted in Language Design | 3 Comments

Haskell VCache 0.1

The first version of Haskell VCache is now available on Hackage. Before installing VCache, you may need to `sudo apt-get install lmdb-dev` on your Linux system, or install LMDB and its header files by some other means.

At this point, the primary features of VCache seem solid. PVars and VRefs are each first-class. The reference counting garbage collection is very simple. Structure sharing for values works. The concurrent aspects are kept simple, albeit bottlenecked at the moment by a single MVar. The batching mechanisms work very well. It seems easy to use, at least in ghci. But testing isn’t very rigorous.

There are performance improvements to be had, especially with respect to cache management.

But if you’re interested in an alternative to acid-state that can support very large values and fine-grained, first-class mutable variables, then give VCache a try. Also, let me know where the documentation can be improved.

Posted in Language Design | 3 Comments

Extensible Syntax for Awelon Project

I’ve been exploring extensible language for almost as long as I’ve been interested in PL. A few years ago, I posted some constraints for user-defined syntax that I had developed c.2007. More recently, I’ve described a alternative to DSLs based on embedding applet-like objects as literals in source code. The latter is especially interesting because it might enable ad-hoc graphical languages for problems like building game worlds, music, graph literals.

Despite my interest, it remains unclear to me whether this is a “good idea”. Extensible syntax presents significant challenges for tooling (e.g. reporting errors, syntax highlighting), and can result in a steep learning curve for programmers. So, when I started Awelon project, I chose to table efforts on this subject.

With the ongoing development of Wikilon, I feel it’s time to explore extensible syntax more seriously. And I’m also interested in the possibility of unifying user-defined syntax with embedded literal objects.

Summary of constraints for extensible syntax:

  • limit syntax to boundaries of one module
  • use one short word to indicate syntax
  • avoid modifying syntax within a module
  • syntax should be modular, composable
  • syntax should be a library, not external tool
  • trace errors to whatever the programmer sees

The motivation for these constraints is primarily to minimize boiler-plate, support “small” modules, simplify incremental compilation, support tooling.

I plan to base Wikilon on Awelon Object (AO) code. Each AO word is a function, a module, and a page in the wiki. AO has a simplistic Forth-like syntax. A typical module is between five and fifteen words and literals. Many modules are just one line. A wiki with very small pages is viable if we allow the browser to open multiple pages on a single canvas, e.g. similar to CodeBubbles.

With DSLs, data modules, etc. some of these pages would become much larger than the typical AO word. But we should favor a low overhead so we can still use small modules even with alternative syntax. Using just one word per module to indicate language seems a nice target. If a language needs additional parameters, it may always parse them.

Module text would then be processed by the language.  The exact nature of this processing is still an open question: we might translate the input to AO code, or to an AST, or leverage a monadic API to build some object representing the code’s meaning. I favor targeting an API because it better fits the ‘code as material’ metaphor and readily supports staged computation (i.e. use behaviors defined by other words) and may be readily extensible and composable if we choose a good general model for languages (e.g. machines rather than parser combinators).

A concrete import/export format like the following might work well enough:

#AO
:benchmark 0 [4 +] 1000 1000 * repeat
:repeat assertNatural .rw [step.repeat] bind fixpoint .wl swap inline
:step.repeat rot dup 0 gt [drop3] [action.repeat] if_
:action.repeat dec unrot dip2 inline
:fixpoint .prebind_l .fixfirst
:.prebind_l %r [%l] %rol
:.fixfirst %r' [%^'ow^'zowvr$c] %r^owol
#COW
:fibonacci 
 MoO moO MoO mOo MOO OOM MMM moO moO
 MMM mOo mOo moO MMM mOo MMM moO moO
 MOO MOo mOo MoO moO moo mOo mOo moo

I favor flat, streamable formats like this because they’re good not just for import and export, but also for continuous updates and synchronization. This format could be easily extended with new toplevel operations, e.g. to add temporal information or author metadata, or to support diffs. I’m assuming that newlines in content are escaped by trivially following them with a space.

The above seems a good start, though I might favor plain English ‘define’ and ‘using’ instead of ‘:’ and ‘#’.

But can we also support nice graphical languages?

I’ve had some difficulty adapting ‘embedded literal objects‘ to Wikilon. One of the big problems was identity: when objects are embedded at arbitrary locations in source code, it becomes difficult to name those objects, i.e. you’re stuck saying something like “the third embedded literal in function foo” and that name might not be very stable.

Naming objects is convenient for many reasons. For example, it simplifies recording ‘view’ information, e.g. such that clients might render multiple views of the same 3D scenegraph, each with a different rotation or camera position. Naming also simplifies streaming updates. For example, the browser might stream “I performed update XYZZY with parameters 1,2,3,4″ updates to the named object, and the Wikilon server might stream rendering updates back to multiple views of that object. The ability to stream updates is valuable because embedded literal objects may grow large, and we’d rather avoid the costs of shuttling them between client and server or repeatedly recompiling them.

To support names, I must abandon the ‘literal’ aspect of embedded literal objects.

Instead, I would align each object with an AO word. This word becomes a simple, stable identity for the object. AO words serve yet another role: module, function, Wikilon page, and now interactive object or applet. The Wikilon page for each of these objects might present itself as an interactive web application with web sockets to maintain updates. However, unlike general purpose web apps, these objects would be self-contained, purely functional, highly modular. It would generally be easy to peek under the hood to see the code, or to ‘clone’ an object by creating a new word with the same initial code. Stable (bookmarkable) views of the object might be represented with a few query parameters in a URL (enabling some server-side rendering), or a little information in the anchor.

We need logic to render these objects, to receive updates, to generate AO code, to trace errors.

The language word seems a fine place to specify this additional logic. Rather than a function that simply compiles text, the language function must accepts multiple methods, or alternatively constructs an object (based on an initial text) that can handle various update methods and pretty-print an updated version of the source in the end. (The latter could be a lot more efficient when processing lots of updates.)

I’m a bit wary because It’s difficult to structurally guarantee or prove useful properties for languages like: “pretty printing then reconstructing is observationally equivalent to the identity function”. Fortunately, Wikilon does make it feasible to test properties like this against historical definitions and updates, and various dynamic tests can be applied systematically (e.g. does `parse print parse print` result in the same printed content and AO behavior each time?). I believe we can achieve a reasonable degree of confidence in our DSLs and embedded objects.

There are still many details to hammer out, such as what is a good type for the language function. But I think extensible syntax will be worth pursuing, especially if I can support graphical languages and flexible editors.

Posted in Language Design | 3 Comments

VCache, an Acid-State Killer

Haskell has a convenient package called acid-state, which enables a Haskell value to be utilized as a database, essentially via the command pattern: updates are logged, and programmers can perform the occasional checkpoint; on restart, the checkpoint is loaded and updates are applied. In Haskell, treating the database as a value is convenient because it allows developers to build the domain model and operations upon it from pure functions, which are easy to reason about.

However, acid-state doesn’t scale nicely. Large checkpoints take a long time to load at startup. Large values in memory create a large burden for the Haskell garbage collector, which must occasionally perform a full sweep. Also, acid-state doesn’t compose very nicely. If you have two acid-state resources, there is no way to update both of them atomically. If you have some volatile storage for the particular process, you cannot easily keep it consistent with the acid-state resource.

Haskell has plenty of support for other databases, of course. So we can scale by switching to one of those. But this means accepting a lot of incidental complexity and semantic noise, giving up much simplicity from acid-state.

I want the simplicity of acid-state AND scalability. Oh, and composition would be nice, too.

To that end, I am currently developing VCache. The primary innovation of VCache is to support a clean separation of value references of kind VRef vs. persistent variables of kind PVar.

A VRef is a handle to an immutable value that is primarily represented in an auxiliary address space. Importantly:

  1. VRef’d values may contain VRefs, supporting tree structures and DAGs.
  2. Construction of a VRef and dereferencing values are pure computations.
  3. Dereferenced values are usually cached to reduce serialization overheads.
  4. VRefs are tracked via weak refs, and aren’t GC’d from the auxiliary space.

VRefs can represent very large domain models without overly burdening system memory. Note that these models don’t need to be persistent. Use of VRefs to create very large, volatile models may be useful for some applications, e.g. for the same reasons you might want a temporary filesystem. Volatile values arise naturally with lazy evaluation, optimistic concurrency, constraint solving search, and so on.

A PVar is a variable named by bytestring, and may be loaded from one execution to another. The content of a PVar is the same class of values as a VRef, and thus PVars can contain very large domain models – much larger than system memory – assuming developers structure them appropriately.

PVars are implemented as thin wrappers around STM TVars, and may be updated transactionally with other PVars and TVars (arbitrary STM operations may be lifted into a PVar transaction). The resulting transactions are atomic, consistent, isolated, and optionally durable. (Durability simply involves waiting for a synchronization after running the transaction.)

Supporting STM transactions together with persistent variables is very convenient for ensuring the volatile elements of the application are consistent with the persistent elements. The transactions also support composition and extension – i.e. it’s easy to add new persistent features without trying to modify a global datatype. Optimistic concurrency does come with a few weaknesses of its own, of course. Large transactions may ‘starve’ if small, fast transactions repeatedly manipulate variables used by the larger transaction. But I believe this concern is easy to address at the application layer, e.g. by constructing queues and cooperation patterns around the more contentious resources.

The VCache auxiliary address space is backed by LMDB. LMDB is a a read-optimized embedded key-value store that utilizes MVCC to ensure readers never wait on writers. LMDB is limited to one writer at a time… but, within that limit, it at least writes efficiently. Usefully, VCache mitigates LMDB’s few weaknesses. The Haskell STM layer easily supports concurrent read-write transactions. The VCache will amortize synchronization costs by combining transactions that occur around the same time. Lazy construction of VRefs may avoid some writes entirely.

The VCache garbage collector uses reference counting at the LMDB layer, and operates incrementally and concurrently in a background thread. As our LMDB file scales to gigabytes or terabytes or larger, we never need to sweep the whole thing. When the application is restarted, the garbage collector can eliminate values that were held only by volatile VRefs on the previous run. Usefully, LMDB covers the whole compaction issue as part of maintaining the underlying B-trees.

With regards to value caching, programmers will have some heuristic control over cache behavior on a per value basis, and the ability to roughly indicate roughly how much memory to use (albeit, based roughly on LMDB encoding rather than actual Haskell memory consumption). Values are also organized in VCache such that, at least for smaller values, structures allocated at around the same time will typically be on the same pages in LMDB. This allows effective use of smaller values.

VCache performs structure sharing at the VRef layer. This is useful because it saves space, because it may save some writes to disk, because it simplifies some forms of reasoning (e.g. about space costs for idempotent updates to a historical database), and because it allows comparisons of VRefs without loading the underlying values – e.g. such that we can efficiently perform a ‘diff’ on large trees that are mostly the same. When not used, the performance and space overhead should be is minor. I feel that structure sharing is one of those features that’s easiest to leverage when it’s pervasive.

My plan is to use VCache to support my Wikilon project.

Wikilon has an unusual ‘file’ system: files consist of bytecode and represent purely functional objects. That bytecode may contain values. But, for very large values – e.g. a tree map with a thousand elements – I’d rather avoid loading and parsing the bytecode to construct that value in memory. Instead, I’ll include a simple escape in the bytecode to access the value as a VRef. This allows me to preserve the O(lg(N)) updates or queries on the map, as opposed to the O(N) serialization costs for the full map. Further, that map could easily be part of a query response or function input. VCache will make it possible to treat persistent objects in a first-class manner.

VCache is not ready for use. And it won’t see much work over the next couple weeks (I’m on vacation). But I’ve hammered out most of the details, I’ve written the Haskell bindings to LMDB, and I’ve started implementation. I’m excited about this. I think, if you’re a Haskell developer or are interested in creating very large Haskell apps, that maybe you should be excited about it, too.

Posted in Concurrency, Modularity, State | 4 Comments

Wikilon Filesystem

Wikilon isn’t just an online dictionary for AO code. It will offer a simple filesystem-like abstraction, allowing use as a software platform for creating interesting applications and services. Simple isn’t easy. I’ve spent a great many hours designing, refining, and backtracking on the subject over the last ten days.

A few of the design points I find interesting:

  • Stateful files are replaced by purely functional objects, i.e. closures with fixpoint behavior. An unsophisticated purely functional object is `µO.[arg→(O*result)]`. I plan to use the same model as for embedded literal objects, which cleanly separates queries from updates. For example
  • Basic authorities aren’t just “read” and “write”, but also “update” and “query”. Read and write are still available, but operate at source level (reflection, direct manipulation), whereas update and query authorities use message passing and thus allow the object in question to protect its own invariants. This should simplify collaboration between software agents.
  • Usefully, state resources for RDP can essentially use the same model as state resources for imperative code, albeit with a small tweak: the update method takes a set of demands rather than a singular update message, and I must compute to a fixpoint.
  • Scripts can also be transparently used as resources, allowing ad-hoc adaptability and attenuation of the resource model. By ‘script’, I mean bytecode that is run when a query or update hits that resource, with the possibility of querying or updating other objects.
  • Updates are typically transactional (for imperative) or logically timed (for RDP) so this filesystem is a great deal easier to reason about than conventional filesystems, even with high levels of concurrency.
  • I’m making extensive use of logarithmic history. Users will be able to step back in time and briefly interact with historical snapshots. Even better, this will support retroactive testing, i.e. applying new tests on all available past states, and rather flexible debug modes where you can operate on past states.
  • Directory structure, relative to convention, is inverted. Every user has their own ‘root’, at least so far as they can tell. Public or shared directories might be mounted as subdirectories into each user’s workspace.
  • Directories have both internal ‘pet’ names, e.g. a file or directory named “bob”, and stable, opaque, external names based on, e.g. `HMAC(serverSecret,parent++"bob")`.
  • Capability secure. Capability strings provide authority to directories or objects, and may be attenuated and shared. In addition to an opaque GUID, each capability includes an authority code and an HMAC to guard the capability against forgery.
  • Simple tactic to mitigate supercalifragilisticexpialidociousID problem. See below.

The capabilities aspect has received much of my attention recently. I’ve determined that I can’t use ABC’s `[{tokens}]` because it creates aliasing problems for imperative code (resulting in race-conditions), and because it also muddles up cross-model queries between imperative transactions and RDP behaviors. Instead of tokens, I’m using strings together with a few annotations and sealers.

The capability strings look like `Mbdfghjkmnpqstxyz…` where `M` is an authority code, and the mess that follows is a bunch of base16 including GUID, HMAC, and potentially some path information. My goal is that this string should look exactly the same in your browser URL as within Wikilon.

Instead of permission flags, which allow invalid combinations, I’ll just use a short enumeration of valid codes: M: query+update, P: update, Q: query, R: read, S: read+update, T: read+write. (I’ve rejected write-only option as unwise, though it could be modeled indirectly via script. In most cases, what you’d really want is update-only.)

Unfortunately, a conventional representation of GUID+HMAC gets to be pretty bulky.

I’ve found a simple tactic to mitigate this: GUID and HMAC can overlap a fair amount so long as I expose enough of the GUID (e.g. 32 bits) to filter down candidates. I can then XOR the candidate GUID with the string to recover the HMAC, then test the HMAC. Thus, a 192 bit GUID plus a 160 bit HMAC could share a space. (I wouldn’t be surprised if other people have been using this idea since before I was born. But it’s the first time I’ve thought of it.)

A simple annotation like {&rsc} could greatly improve performance and ability to locate dead links in scripts or objects.

I’m still hammering out many details.

In addition to GUID and HMAC, I’m thinking I might want to keep some path information in some manner so as to support transitive revocations. One candidate mechanism is to list the first sixteen bits in each GUID in a path from root to the target GUID. If any of the elements on the path (most likely symbolic links) are disabled or now point elsewhere, we could disable the capability. Again, path information could partially overlap the GUID, and perhaps even a little with itself, while still supporting an efficient search. But it couldn’t overlap with the HMAC.

I’m also starting to think that acid-state isn’t going to scale (not with long logarithmic histories and goals like game development), and its interface doesn’t fit right for my use case. I’m now checking out TX, TCache, and Perdure. I’m hoping I won’t need to write my own.

Addendum: I speak more about the resource model on the captalk mailing list.

Posted in Language Design, Security, Stability | 2 Comments

Logarithmic History v3

Exponential decay, resulting in log-scale lists, an excellent way to work with an unbounded amount of history in a bounded amount of space. I’ve described this in two previous articles [1][2]. In my first article, I modeled logarithmic decay via a sequence of ring buffers: each ring would lose (or merge) one of its items rather than passing it forward. In my second article, I expressed a concern that ring buffers might lose information that occurs at a regular frequency, and modeled the decay probabilistically instead. The latter ‘improved’ form of exponential decay is certainly easier to implement, and could be generalized in interesting ways (such as time-based collapse). But original design had a nice property too. For example, it was extremely predictable.

In this article, I describe an even simpler model:

  1. Choose a maximum number of entries for your history buffer.
  2. Whenever you reach the maximum, decimate your history.
  3. To decimate: Take groups of ten entries. From each group, kill one.

Trivial, eh?

The word ‘decimate’ means to kill one man in every ten to punish or subdue a group. This results in a 90% survival rate. Repeatedly decimating our history will give us exponential decay. Operating on groups of ten keeps the operation simple, and ensures the decay process is relatively uniform and predictable. Of course, the particular choice of tenths is arbitrary and could be adjusted without loss of generality.

The decision, then, is how to choose one element from each group to destroy.

The probabilistic approach is to roll a ‘die!’.

The probabilistic approach can be deterministic in a simple way: use a hash function on each group of entries to roll a die for that group. This sort of technique would be convenient when working with replicated histories, distributed databases, and so on.

A non-probabilistic model might heuristically select one entry to erase based on estimated loss of information. For example, if we’re erasing camera frames, we might favor deleting a frame that looks most similar to another frame in the group. Or if our history consists of a path of a robot, we might favor deleting entries where the robot was basically sitting still or moving in a straight line.

Of course, merging is also a fine idea. We could choose two items to merge, ideally in an associative manner such that merge(a,merge(b,c)) = merge(merge(a,b),c); this can be done with weighted averages, keeping counts and sums and sums-of-squares, and generally developing a monoidal type for history entries.

I developed this particular approach to logarithmic history to support Wikilon. Logarithmic history offers a happy medium between keeping a complete history and keeping the most recent version of each page/word. It does have a few disadvantages. For example you can’t get a “permanent link” to a historical version of a word. But you can at least browse historical versions of the wiki/dictionary, get a story for how a system evolved, access old versions of ideas. And the long-term space requirements are very predictable.

Posted in Language Design | 1 Comment

Introducing Wikilon

I have decided to move forward with a wiki-based development environment for Awelon project, under the name Wikilon. Wikilon will be implemented as a Haskell warp web service using acid-state for persistence and websockets for liveness and reactivity.

Wikilon can overcome several limitations that I feel to chafe in the current text-editor + console REPL environment. Because Wikilon will be a persistent web service, it can take much more advantage of incremental and persistent computation – e.g. recompute test results only when the definition of a word used by that test has changed. Wikilon can also serve as a much better platform for long-running behaviors, thus allowing me to move forward with reactive demand programming. Finally, a web browser provides a richer canvas than a console REPL, enabling me to pursue graphical outputs, animation of code execution, interactive programs, and embedded literal objects.

Before now, I’ve held off on developing a wiki-based development environment.

My hope was to bootstrap the language, and implement the web services in my own language. But it seems the further I get along with developing the AO dictionary, the further I have to go. There is an enormous amount of logic involved in warp, acid-state, filesystem IO, and the like. Rather than slogging through all that, I now feel Awelon project and my mental health would be better off were I to jump straight to visually entertaining applications, perhaps in the style of of tangible functions or Logo writer.

Bootstrapping can happen later.

Wikilon will be user-extensible, in the sense that developers can add new web services or web applications simply by defining new words in the AO dictionary. AO leverages ad-hoc naming conventions, e.g. words with prefix `test.` indicate tests that we should run. In this case, new web services might be installed by defining a word like `wikilon/foo` to process HTTP requests on the /foo path. It’s essentially an in-language variation of CGI. However, AO is much easier to secure than CGI, and also subject to partial evaluation (e.g. for a given set of query parameters). Eventually, these web services will provide upgraded interfaces for editing the wiki… and cross-compilation to stand-alone services or applications.

Wikilon shall still provide a REPL-like service, perhaps via websockets.

But I’d like to shift towards a more persistent, reactive testing environment, in a style akin to iPython notebook. As a developer edits definitions for words, they should be able to immediately observe the effects on whichever notes they’re viewing. A given note shall simply be another word in the dictionary, hanging around for anyone who wishes to view it. Similarly, all those `test.` words should be automatic without pressing any buttons, and ideally we should be able to visualize how test and note inputs interact with whichever words we’re editing.

(It could be useful to automatically generate and propose tests, e.g. based on observed failure cases.)

Wikilon’s persistence will make it a lot more scalable than my current approach of parsing the entire dictionary every time I start up the REPL. Granted, this isn’t difficult. But my current approach would easily scale to 10k words; even naively implemented, Wikilon should easily scale to a million. A single Wikilon instance might host thousands of projects and users. But rather than a ‘global’ wiki, I expect we’ll want private forks and DVCS-like push/pull relationships, with public wikis serving a role similar to hackage.

The idea of user programmable wikis isn’t new. But I believe AO is better suited for this role, e.g. with respect to security, portability, concurrency, and composition. AO was designed with wiki-based development in mind.

Posted in Language Design, Live Programming, Open Systems Programming, User Interface | Tagged , | 2 Comments

Awelon Progress Report VIII

In the seven weeks since my last report, I’ve primarily focused on growing my AO dictionaries. This was my intention.

Seven weeks ago, I had 1298 words. I’m now up to 2395 words (an 85% increase), including 605 documentation words, 134 test words, and 310 Lisp-based `car/cdr/cdaddr` words (in five variations, including a functional `setcadadr!`). Many of these are undocumented intra-functional words, e.g. for the word `foo` that processes a tree I might implement multiple words of the form `tree.foo`, `leaf.foo`, and `node.foo`. The word `foo` then arranges inputs on the stack and applies the fixpoint.

In addition to the trivial Lisp-based words, development was in four main areas: streams, processes, pseudo-random number generation, and binary search trees.

I’m not entirely satisfied. I had hoped at this point to be finished finger tree sequences and maps based on self-balancing binary search trees. But I’ve only gotten so far as prototyping maps with a plain old imbalanced tree.

In addition to my primary efforts, my secondary efforts include:

  • Shift my ad-hoc document-based ‘TODO’ lists into the Github issue tracking system. (incrementally ongoing!)
  • Efficient binary representations in ABC, at least for storage and transmission. This issue kept distracting me when I was working on words for bit-stream processing.
  • Candidate algorithms for compression for ABC resources. I’m leaning towards a variation on LZSS at this point that guarantees 1:64 worst-case expansion (in addition to the 1:128 overhead for large binaries). Most of ABC should compress very well with LZSS.
  • Developed a fixpoint combinator with static transport overhead. Before, fixpoint required 2N+8 bytes to fixpoint an N-byte block. It now requires N+30 bytes and is more amenable to optimization and compression. This avoids exponential expansion for layers of recursive processes, and should result in more efficient storage and transport of ABC.
  • Attempting to more clearly express my ideas, especially why I’m developing a new bytecode to start with. I’ve expressed some of them in the code as material post, though that only hits one aspect and doesn’t really touch the importance of streamable code to my vision for Awelon project. I’ve also tried to tweak the AboutABC opening a bit.
  • Developed a model for cryptographic value sealing that is simple in both concept and implementation, and that I feel is promising enough to prototype. Progress here was triggered largely by the development of efficient binary representations sufficient for cryptographic values. Value sealing is a much more accessible model of cryptography than experienced in most conventional languages.

While I might wish for faster progress, I’m certainly feeling the difference these last couple months have made.

What’s Next?

I shall continue developing the dictionary as my primary effort. Despite my progress, I’m still a few layers of abstraction below implementation of type systems, grammars, constraint models, optimizing compilers, web services, abstract runtimes and VMs, and so on.

Eventually, around 5k-10k words, I’ll start to feel the limits of my current implementation. Every 1k words takes about 0.1s to load into memory, more if using the rather ad-hoc prototype of separate compilation.

When it becomes intolerable, or (more likely) if I feel enough progress has been achieved to develop web-apps and an ABC-to-JavaScript compiler in AO, I’ll shift my efforts towards re-implementing AO and ABC development environments as web services, with a more persistent database that can be incrementally updated, tested, etc. and never needs to be reloaded. I have many ideas for a more visual ‘live programming’ development environment that could be implemented in a browser, in addition to a wiki-like experience.

Compilation to native code, C, OpenCL, LLVM, or similar will still be an important target. Kyle Blake is developing abcc, an ABC to C compiler (at the moment), as part of a graduate project.

Posted in Language Design | Leave a comment

Binaries in ABC (or UTF-8)

Awelon Bytecode (ABC) doesn’t have good direct support for binary data. Binaries are not uncommon: cipher texts, secure hash values, compressed visual or audio data, raw sensory data, and so on. Efficient storage and communication of binaries is a desirable feature.

This morning, I had a spark of an idea: a compression algorithm can be specialized to recognize a large sequence of base64 data and simply re-encode it as a run length of binary. This simple technique cleanly separates the concerns of simple representation at the language layer vs. efficient storage and transmission of binary data.

But base64 is hardly relevant in this case, isn’t it? We could use the simpler base16 and reap similar benefits.

Awelon Bytecode is encoded in UTF-8, and may use the full UTF-8 via embedded text. A useful property of UTF-8 is that it has thirteen unused bytes: 0xC0, 0xC1, 0xF5..0xFF. It is feasible to simply usurp one of these bytes to perform a compression pass specialized for UTF-8 that is expected to embed base16. Proposed format:

  • header byte: 0xF8
  • single length byte L, encoding 3..256 bytes (0xFE,0xFF not used)
  • thus encoding 6..512 base16 characters (always an even number)

Thus, we have UTF-8 with efficiently embedded base16 binaries. The overhead for large binaries is less than 1%. The break-even point with base64 is 6 bytes. The encoder requires 512 bytes lookahead, and the decoder requires no buffer. I can easily imagine applying this idea to HTML+JavaScript. Further compression remains quite viable.

For ABC in particular, I’m considering use of a specialized base16 alphabet `bdfghjkmnpqstxyz`. This is mostly to avoid spelling offensive words (no vowels) and avoid interference with numbers (0-9) or ABC data plumbing (mostly `vrwlc`).

Posted in Distributed Programming, Language Design | 2 Comments

Code is Material

‘Code is data’ is a metaphor around which the Lisp communities (and REBOL, and others) have built programming languages. The metaphor is effective in these languages because code is presented and represented as a simple structure which may easily be manipulated and processed as data, or evaluated; this is leveraged for development of macros, fexprs, and DSLs. In context of other languages, e.g. C++ or Java, the ‘code is data’ metaphor is much less effective. Metaphors are useful insofar as they shape our tools, our intuitions, our communities.

‘Code is material’ is a similar metaphor, an alternative to ‘code is data’. I’m not aware of any community that consciously uses the ‘code is material’ metaphor, but it does seem applicable in retrospect to flow-based programming and some visual dataflow languages, and perhaps a few esoteric languages like Snusp. By ‘material’ I mean to invoke the intuitions surrounding physical substances used to build other products – e.g. wood, metal, glass, rope. We have many intuitions surrounding materials:

  • materials are portable and fungible by nature (and often available in bulk)
  • behavior of a material is intrinsic or based on universal physical laws
  • materials can be composed or shaped into new materials (molecules, fabrics, bricks)
  • materials vary widely in quality, robustness, ease of use, etc.
  • we don’t ask whether materials are meaningful, only whether they are useful

Of course, code does have an advantage over physical materials: code can be cheaply and precisely replicated.

‘Code is data’ and ‘code is material’ stand in opposition with respect the presence of an ‘interpreter’. Data is interpreted. It is possible to have multiple interpretations of the same data. Materials aren’t interpreted. The behavior of materials is intrinsic or based on universal laws.

For distributed programming or mobile code systems, I like the idea that my code should have the same behavior everywhere. For reasoning, refactoring, and optimization purposes, I like the ability to transform code in systematic ways while preserving behavior. For modularity and security purposes, I like the possibility of hiding implementation details or information within code and closures. These desiderata are better supported if the behavior of code be universal, independent of interpreter.

I imagine, if we take the ‘code is material’ metaphor far enough, that we would be working with composable bricks or structures of code in a virtual reality or augmented reality programming environment, replicating them as needed. Occasionally we might disassemble and reshape a structure for use in a particular problem (perhaps manipulating embedded literal widgets). There would be very little need for a conventional programming language; programming could be approached more as mechanical tinkering. The resulting code-is-material structures may be given part numbers based on secure hash, or directly shared with friends and communities. Applications would consist of devices or tools, themselves constructed of code, of an appropriate type to be ‘installed’ or ‘equipped’ in the programmer’s environment.

Awelon project is aiming towards such a future. The Awelon Bytecode (ABC) is an excellent foundation for the ‘code is material’ metaphor. Awelon Object (AO) is a Forth-like language, allowing users to define words. But those words do not remain entangled with the environment: every AO word is compiled into a finite and environment-independent sequence of ABC, and thus is easily treated as a named ‘brick’ of material, readily ported to other programming environments.

ABC has many useful properties that make it especially suitable for the ‘code is material’ metaphor:

  1. ABC is highly compositional. Concatenation is sequential composition. Concurrent composition is possible due to the causal commutativity requirement. Composition of conditional behaviors is available because ABC favors sum types instead of closed if/then/else behaviors.
  2. ABC code interacts with its environment locally. Every operator manipulates a local value. ABC lacks the conventional heap, pointers, and remote references. Thus, ABC’s behavior is similar to interactions between physical materials.
  3. ABC has atomic and molecular structure: at the lowest level, there are about forty bytecodes that can be understood as atoms. A small, reusable sequence of bytecodes can be understood as a molecule.
  4. ABC’s molecular structure is very linear, similar to genetic code. It is quite feasible to search for new ABC materials by genetic programming, substituting and mixing reusable ‘genes’ (of appropriate type) in a larger program, which might then construct other objects. We can deeply leverage ABC with biological material metaphors (wood, thatch, etc.).
  5. ABC is streamable, i.e. instead of the common stored program concept. A streamable program cannot jump backwards. ABC cannot jump at all. (Loops are modeled using fixpoint combinators; no jumping required.) This allows ABC to model not just ‘solid’ materials such as bricks, but also unbounded ‘fluid’ materials, such as command streams or user input.

One of the underlying ideas for Awelon project is the ability to extract reusable code – e.g. user macros and tools – from the user input stream. This is a form of programming by example. And the envisioned Awelon project environment is certainly leaning towards the VR/AR system described earlier. I’m happy to have found a simple metaphor to describe almost every idea underlying Awelon project: code is material.

Aside: A related metaphor is that ‘programming languages are materials’, occasionally presented in opposition to ‘programming languages are tools’ [1]. But I feel it isn’t just the language that is the material; it’s also the frameworks, the libraries… the code in general.

Posted in Distributed Programming, Language Design, Modularity, Open Systems Programming, UserInterface | 14 Comments