I’ve been exploring extensible language for almost as long as I’ve been interested in PL. A few years ago, I posted some constraints for user-defined syntax that I had developed c.2007. More recently, I’ve described a alternative to DSLs based on embedding applet-like objects as literals in source code. The latter is especially interesting because it might enable ad-hoc graphical languages for problems like building game worlds, music, graph literals.
Despite my interest, it remains unclear to me whether this is a “good idea”. Extensible syntax presents significant challenges for tooling (e.g. reporting errors, syntax highlighting), and can result in a steep learning curve for programmers. So, when I started Awelon project, I chose to table efforts on this subject.
With the ongoing development of Wikilon, I feel it’s time to explore extensible syntax more seriously. And I’m also interested in the possibility of unifying user-defined syntax with embedded literal objects.
Summary of constraints for extensible syntax:
- limit syntax to boundaries of one module
- use one short word to indicate syntax
- avoid modifying syntax within a module
- syntax should be modular, composable
- syntax should be a library, not external tool
- trace errors to whatever the programmer sees
The motivation for these constraints is primarily to minimize boiler-plate, support “small” modules, simplify incremental compilation, support tooling.
I plan to base Wikilon on Awelon Object (AO) code. Each AO word is a function, a module, and a page in the wiki. AO has a simplistic Forth-like syntax. A typical module is between five and fifteen words and literals. Many modules are just one line. A wiki with very small pages is viable if we allow the browser to open multiple pages on a single canvas, e.g. similar to CodeBubbles.
With DSLs, data modules, etc. some of these pages would become much larger than the typical AO word. But we should favor a low overhead so we can still use small modules even with alternative syntax. Using just one word per module to indicate language seems a nice target. If a language needs additional parameters, it may always parse them.
Module text would then be processed by the language. The exact nature of this processing is still an open question: we might translate the input to AO code, or to an AST, or leverage a monadic API to build some object representing the code’s meaning. I favor targeting an API because it better fits the ‘code as material’ metaphor and readily supports staged computation (i.e. use behaviors defined by other words) and may be readily extensible and composable if we choose a good general model for languages (e.g. machines rather than parser combinators).
A concrete import/export format like the following might work well enough:
:benchmark 0 [4 +] 1000 1000 * repeat
:repeat assertNatural .rw [step.repeat] bind fixpoint .wl swap inline
:step.repeat rot dup 0 gt [drop3] [action.repeat] if_
:action.repeat dec unrot dip2 inline
:fixpoint .prebind_l .fixfirst
:.prebind_l %r [%l] %rol
:.fixfirst %r' [%^'ow^'zowvr$c] %r^owol
MoO moO MoO mOo MOO OOM MMM moO moO
MMM mOo mOo moO MMM mOo MMM moO moO
MOO MOo mOo MoO moO moo mOo mOo moo
I favor flat, streamable formats like this because they’re good not just for import and export, but also for continuous updates and synchronization. This format could be easily extended with new toplevel operations, e.g. to add temporal information or author metadata, or to support diffs. I’m assuming that newlines in content are escaped by trivially following them with a space.
The above seems a good start, though I might favor plain English ‘define’ and ‘using’ instead of ‘:’ and ‘#’.
But can we also support nice graphical languages?
I’ve had some difficulty adapting ‘embedded literal objects‘ to Wikilon. One of the big problems was identity: when objects are embedded at arbitrary locations in source code, it becomes difficult to name those objects, i.e. you’re stuck saying something like “the third embedded literal in function foo” and that name might not be very stable.
Naming objects is convenient for many reasons. For example, it simplifies recording ‘view’ information, e.g. such that clients might render multiple views of the same 3D scenegraph, each with a different rotation or camera position. Naming also simplifies streaming updates. For example, the browser might stream “I performed update XYZZY with parameters 1,2,3,4″ updates to the named object, and the Wikilon server might stream rendering updates back to multiple views of that object. The ability to stream updates is valuable because embedded literal objects may grow large, and we’d rather avoid the costs of shuttling them between client and server or repeatedly recompiling them.
To support names, I must abandon the ‘literal’ aspect of embedded literal objects.
Instead, I would align each object with an AO word. This word becomes a simple, stable identity for the object. AO words serve yet another role: module, function, Wikilon page, and now interactive object or applet. The Wikilon page for each of these objects might present itself as an interactive web application with web sockets to maintain updates. However, unlike general purpose web apps, these objects would be self-contained, purely functional, highly modular. It would generally be easy to peek under the hood to see the code, or to ‘clone’ an object by creating a new word with the same initial code. Stable (bookmarkable) views of the object might be represented with a few query parameters in a URL (enabling some server-side rendering), or a little information in the anchor.
We need logic to render these objects, to receive updates, to generate AO code, to trace errors.
The language word seems a fine place to specify this additional logic. Rather than a function that simply compiles text, the language function must accepts multiple methods, or alternatively constructs an object (based on an initial text) that can handle various update methods and pretty-print an updated version of the source in the end. (The latter could be a lot more efficient when processing lots of updates.)
I’m a bit wary because It’s difficult to structurally guarantee or prove useful properties for languages like: “pretty printing then reconstructing is observationally equivalent to the identity function”. Fortunately, Wikilon does make it feasible to test properties like this against historical definitions and updates, and various dynamic tests can be applied systematically (e.g. does `parse print parse print` result in the same printed content and AO behavior each time?). I believe we can achieve a reasonable degree of confidence in our DSLs and embedded objects.
There are still many details to hammer out, such as what is a good type for the language function. But I think extensible syntax will be worth pursuing, especially if I can support graphical languages and flexible editors.