Program Rewriting

Rewriting is not an unusual basis for programming language formalization. Lambda calculus is frequently expressed in terms of rewriting, as are combinator logics. However, rewriting seems to be unusual at the implementation layer. Instead, we model environments, stacks, none of which are represented in the original program. We abort computation upon encountering free variables or stack underflow, neither of which are errors when the same program is placed in a different context. We worry over control flows and eager vs. lazy evaluation and parallel strategies, all the while struggling with constant propagation and partial evaluations.

Suppose we implement a language primarily by rewriting. We can feed our program to our evaluator, together with a space and effort quota. It chugs and churns and rewrites for a while, then outputs a different representation of the same program – something we can view and edit just like the input program. (To clarify, this is not the same as generalized term rewriting. The evaluator needs only one specific set of rewrite rules, those for our language semantics.) What have we gained?

  • Simplified resource management. Even upon hitting quota limits, we receive a useful, partially evaluated result. This is similar to a memory dump, but is wholly independent from the evaluator. We can continue evaluation, perhaps using a larger space quota or slicing a large computation into small, independent pieces.
  • Simplified debugging on error. If our language doesn’t know what to do with (“cat”+3), then we’ll simply carry that expression into the output, giving developers a concrete view of the error. Because there is no context in which that is okay, our evaluator may even reify it as `error(“cat”+3)` so errors are easily counted, highlighted in red, discovered by text search, etc… Our program output on error isn’t a horrifying, illegible memory dump, but rather is the same language as our input and may be rendered and studied using the same tools and text editors.
  • Improved debugging HCI. All program ‘state’ is explicitly modeled as part of “the program”. Thus, there is never any need to hover over or query a variable named `x` to read its value. Instead, you’ll plainly have an `x = 1234` somewhere in the program. Similarly, if a program is evaluating multiple instances of a subprogram (e.g. due to recursive loops or parallelism), you’ll be able to see those instances, which effectively gives you a rendering of the full stack and every parallel computation. Of course, we’ll need to use progressive disclosure techniques for rendering of very large structures. But that can be achieved by simple heuristics or data type annotations, is entirely data oriented, and is a much easier problem to solve.
  • Simplified breakpoint based debugging. We can reify breakpoints within our language as `@foo(P)`, configured by name as an additional evaluator input. If inactive, @foo(P) returns P. If active, our evaluator rewrites to `@foo:123(P)` (with `123` unique in the output) and treats the breakpoint as a barrier to use of P, but may evaluation of independent subprograms. Our output – the partially evaluated program – can be rendered as plain text (or using our normal program editor), and active breakpoints are easily highlighted, found by searched, and deleted by hand or by pattern.
  • Simplified log based debugging and profiling. We can trivially configure some breakpoints instead as ‘log points’. For example, @console(P) might both return P and print P to the console. A simple convention would an extensible set of names default to logging (e.g. @log:foo). Of course, logging doesn’t provide much context. We can do better with history based debugging.
  • Simplified history based debugging and program ‘animation’. We can animate on arbitrary breakpoints like `@foo` by simply evaluating as far as we can go, taking a snapshot, dropping all the `@foo:*` breakpoints, then repeating. We can animate on multiple breakpoints choosing a strategy: pararallel (@foo and @bar release together), hierarchical (as many @foo frames as we can manage, then one @bar frame, repeat), interleaved (@foo @bar @foo @bar), etc.. Because we’re halting on developer selected breakpoints (rather than time quotas) the intermediate programs will be predictably structured, easy to analyze and debug. Ultimately, we have many megabytes or gigabytes of highly compressible program-snapshots that could easily be rendered into a GIF image.
  • Simplified partial evaluation and staging. With rewriting, partial evaluation is the normal case, i.e. we can take any subprogram and evaluate it as far as it goes, independently of input. With a few annotations like `asap(P)`, we can tell an evaluator to prioritize evaluation deep within a program, giving us effective control of staged computing, no need to juggle anything.
  • Supports expressive spreadsheet HCI. In a typical spreadsheet, cells are inexpressive programs that reduce to an anemic data model. With program to program rewriting, cells can express anything our language can express – e.g. functions, objects, databases. Even when a cell contains a function, it’s the partially evaluated function, which allows for convenient inspection. Given expressive types, we can support have flexible rendering options. Using the ‘animation’ idea, spreadsheet cells could additionally render their own evaluations in a loop for anyone who is interested.
  • Simplified integration of DSLs. In context of rewriting, it is easy to leverage partial applications into partial evaluations. Thus, DSLs need be no worse than the host language for performance, not even for external DSLs. This does leave some interesting challenges, such as tracing errors back to lines in the original DSL code. (I do have some fledgling ideas about annotations for tracing, e.g. configuring a @gate to add hidden, contagious markers to data, perhaps with a limited generation count. Such a technique would be useful for active debugging in general.)
  • Simplified linking. Programs are their own representations, so we only need to deal with the problem of linking values. Linking values is a much easier problem compared to linking of mutable things, e.g. leveraging secure hash URLs. With program rewriting, a runtime link failure isn’t a huge problem. The only consequence is evaluation stalls, much like failing on an effort quota. The output remains a valid program and may easily be continued later, e.g. when the network is up.
  • Simplified big data handling. Because runtime linking is lightweight, secure, and fail-safe, we can feel free to tuck things behind links at runtime. For example, if we don’t expect to need P any time soon, we could say `stow(P)` and we get back a link reference to P that may later be loaded. This technique can be directly applied in the construction of data structures – log structured merge trees, hitchhiker trees, finger-tree deques – that may be vastly larger than working memory.
  • Simplified compilation model. When linking subprograms by secure hash, we are not restricted to loading the plain text representation. We may instead load a cached representation that is heavily optimized for the runtime, where the binaries are compact and the functions run fast. We may use annotations like `jit(P)` or `optimize(P)` to encourage the runtime to perform compilation. Interestingly, linking subprograms by secure hash gives us shared objects by default.
  • Simplified IO. Assume we model a `type Process = Request → (Response, Process)` and hook it up to a network socket. Request, Response, and Process are programs, because all values in program rewriting are represented by programs. They come with plain-text representation, maybe even a type checker. This eliminates a lot of ad-hoc make work surrounding serialization and input validation. Depending on their types, Request and Response may be very expressive, e.g. containing processes themselves, linked references to other data, and procedurally generated structure. Or they might be plain old data types. Injection and extraction of data by other languages needn’t be any more difficult than working with JSON or SQL.
  • Simplified transparent persistence. In the aforementioned process model, the Process is easily checkpointed to plain text, and completed Requests are easily recorded to a log. So we can easily use a simple checkpoint persistence model, heuristically updating the checkpoint and clearing the log. This, again, follows from everything being an independently serializable program.
  • Simplified parallelism. Rewriting multiple parts of a program is the default case rather than the a special case that requires a lot of effort to make happen. Because even the most lightweight of parallelism tends to have some overhead, we might still use explicit annotations like `par(P)` to indicate parallel evaluation. However, we don’t need to make special efforts to ‘touch’ that subprogram with the a control-flow cursor. Even when running sequentially, a good evaluator will occasionally yield work on one part of the program to go work on another, both to avoid burning a limited effort quota on an infinite loop and to improve the debugging experience.
  • Simplified distributed computing. Evaluation of a very large program can be divided among multiple processors in a cloud or mesh network. If a node fails, the worst that happens is we need to recompute some subprograms, and we can limit rework by performing occasional checkpoints. Links to big data or parallel/lazy computations can be communicated cheaply, with the data being serialized only when necessary, which makes it easier to pipeline data through subprograms that are mostly doing data-plumbing work. With annotations like `remote(Node,P)` we can potentially support declarative distribution.

A lot of difficult or annoying problems are simplified upon fully embracing `Program → Program` rewriting as our primary evaluation model. Also, I do not believe performance will suffer, not if we dedicate even a moderate fraction as much time to taking program rewriting seriously as we have historically dedicated to imperative programming.

Has the mainstream computing industry hobbled itself by sneering at rewriting? I am starting to thing so.

And I am now giving serious consideration to replacing my project’s existing bytecode with a variant more friendly to rewriting.

Posted in Language Design, User Interface | Tagged , | 7 Comments

Profiling Pure Code

Side effects make profiling simple: at any time, you can just ask for the time now, ask for the time later, and compute the difference. For pure code, this option isn’t readily available. Fortunately, there are other methods for profiling. Performing periodic stack traces can offer a rough but representative view of where a program spends its time. However, I’m not aware of any profiling method that beats the ad-hoc flexibility of stopwatch style profiling with side effects.

Today I had an interesting, simple idea to support stopwatch style timers fully within pure code. The idea relies on opaque ‘timer’ values. Like a stopwatch, our timer can be active or paused. In either case, I can use a simple integer to represent the time values, perhaps with microsecond precision. Here’s a rough sketch of the code (in pseudo Haskell):

newtype ActiveTimer = Active !Int64 -- microseconds since epoch
newtype PausedTimer = Paused !Int64 -- active microseconds elapsed

newTimer :: PausedTimer
newTimer = Paused 0

startTimer :: PausedTimer → ActiveTimer
startTimer = unsafePerformIO . startTimerIO

stopTimer :: ActiveTimer → PausedTimer
stopTimer = unsafePerformIO . stopTimerIO

startTimerIO :: PausedTimer → IO ActiveTimer
startTimerIO (Paused n) = Active . (- n) <$> getTime

stopTimerIO :: ActiveTimer → IO PausedTimer
stopTimerIO (Active n) = Paused . (- n) <$> getTime

unsafeReadTimerIO :: PausedTimer → IO Int64
unsafeReadTimerIO (Paused n) = pure n

getTime :: IO Int64 -- microseconds since epoch

The timers are first-class plain old immutable values. But we’re restricted on how we may safely observe them. The main thing we can safely do with timers is print them as part of a Debug `trace` message.

The purely functional language I’m developing does not have unsafePerformIO. But I do have annotations to address debugging and performance issues, and support for explicit stopwatch style profiling seems a reasonable application. It is trivial for a runtime to provide a local equivalent to newTimer, startTimer, and stopTimer. Rendering resulting timers would require routing through an IO model. Or printing a debug ‘trace’ message of some sort.

Addendum: I am concerned that these timers may hinder caching and structure sharing. I could potentially handle this by simply forbidding timers from contexts where they cause trouble. But I’m reluctant to do so. I might end up favoring second-class named timers, manipulated by annotation.

Posted in Language Design | Tagged , , | 2 Comments

Pure Process Parallelism

For purely functional programming, Haskell-style par/seq parallelism is has a simple API and is flexible. But it assumes you have lightweight threads and scheduling. Unfortunately, par/seq does not work nicely if the overhead for forking a thread or communication between threads rises.

Parallelism may be expensive for a number of reasons. If we want our software to scale to large distributed networks or clouds, for example, there will be some overhead for creating remote threads and any communications between them. My current motivation is lower level: simple memory management. I favor bump-pointer allocation and a compacting collector (a modified Cheney’s algorithm) for reasons of simplicity, sequential performance, robustness and predictability (no fragmentation issues). If I want compacting GC together with parallelism, the easiest solution is to give each thread its own ‘heap’ and to copy data as needed between threads.

Aside: The usual issues with a Cheney’s collector – the doubled address space and stop-the-world latencies – are a non-issue with 64-bit systems and no side-effects with which to observe latency. I also have a way for programmers to tell the runtime to cold-store data that isn’t getting much attention, so we can keep the heap small even when working with big data.

With expensive parallelism, the implicit goal becomes to control creation of and communication between ‘threads’. Existing models have this property – actors, CSP, flow-based programming, RPC, Erlang processes, etc. tend to involve a stable mass of code that receives data and either sends more messages or returns some other form of result. Of course, most existing models are also complected with notions of identity and concurrency.

For purely functional programming, I will extract just the essential feature: the partition between the computation and communication. This is what I came up with:

A process function (PF) might be modeled as an anonymous function more or less of affine type `PF = arg → (result, PF)`.

A process function is just a normal anonymous function, and its results won’t vary based on parallel evaluation. But we are free to annotate it for parallel evaluation. The behavior of this annotation is runtime dependent. For a Haskell system, this might just involve a lightweight par/seq wrapper. For the runtime I’m developing, it might involve lazily allocating a thread with its own mmap’d heap. For a distributed system, it might involve clouds and virtual machines.

When we call our process function with an argument, that may involves communicating the argument to a separate process. We immediately get our handles to the result and the returned process function. Ownership of the allocated heap (if any) is shifted to the returned process function. Thus, the process function becomes a meaningful ‘process’ – able to hold onto data within its own private heap and control how much data is copied between threads, and with an implicit queue of incoming arguments.

The immediate result is a lightweight placeholder. This might be represented typefully (e.g. a `Future result`), but it could be implicit like lazy or parallel values in Haskell. Accessing the actual result will require synchronization. But if we delay synchronization, we’re free to pass that placeholder around and activate other processes and thus achieve parallelism.

These process functions are very flexible. Pipeline parallelism is natural if we orchestrate static networks between parallel processes. Arguments and results, and internal state of processes, can be flexibly incremental. Of course, we’re still limited to purely functional computations – side-effects or networking must be modeled explicitly – but we can achieve high-performance side-effects together with parallelism by batching techniques or separating parallel computation of a large message payload from its header. While we cannot directly alias processes – they have no identity – we can always model process identity and networking indirectly in terms of routing through lookup tables. The runtime implementation is also flexible. An M:N work-stealing model between processes and CPUs/heaps is an obvious choice. A distributed runtime could feasibly migrate processes based on various heuristics – where synchronization is occurring, cost of serializing messages and results, load balancing, etc..

Process functions generalize on par/seq parallelism, in the sense that par/seq essentially would create a fresh process for each evaluation then drop the returned function.

Expressing parallel programs with process functions instead of par/seq should improve scalability to alternative runtime implementations and distributed systems, and give more strategic control over parallelism in general.

Posted in Concurrency, Language Design | Tagged , , , , | Leave a comment

Yield for Effects

The functional programming communities have largely settled on monad interpreters as a basis for modeling effects. However, monadic effects are synchronous by default, which is a disadvantage in many contexts. I would like to encourage exploration of an even simpler effects model in a purely functional context: shared memory.

There has been a recent movement towards “free” monads, which enable separation of the interpreter. The ability to interpret free monads in many different contexts results in more compositional, extensible programs. Also, in my subjective opinion, free monads just feel right – simple, clean, less entangled with their environment, a better fit for the functional programming experience than the opaque monads of the past. In any given step, a free monad either:

  • yields with a requested effect and continuation
  • returns with a final result

That is, free monads are variants of the following:

data Free eff a 
  = forall x . Yield (eff x) (x -> Free eff a) 
  | Return a

Aside: The above representation has horrible performance properties for left-associative monadic binding, fmap, etc.. Fortunately, this is is a solved problem. We need only change our representation of continuations to a tree-structured queue or similar. See Oleg Kiselyov’s code for an example.

A natural consequence is that monadic effects are synchronous by default. Our interpreter must acquire a response of type `x` before continuing. Further, fine grained effects tend to require yielding frequently. This hinders efficient parallelism and batch processing. Of course, we can explicitly model asynchronous effects.

A simple alternative is to model a shared memory as a basis for effects. As a refinement of our free monad:

data Prog shm a
  = Yield shm (shm -> Prog shm a) 
  | Return shm a

Shared memory is a general, securable approach to effects. Our shared memory may consist of stream buffers, message queues, publish-subscribe topic lists, tuple spaces, databases, component entity systems, or something application specific. Upon yielding, our interpreter has opportunity to fill input buffers, flush output buffers, etc.. Buffering is common and explicit within the `shm` type.

Subprograms are easily constrained to operate on fragments of the shared memory. Subprograms that operate on different `shm` types can frequently be integrated with simple lensing techniques. (Adapting monadic effects, which are more eventful in nature, is more challenging in general.) In any case, a lot of the nice properties of free monads are still accessible with shared memory effects.

Shared memory effects are asynchronous by nature, and amenable to efficient batch processing. Yielding introduces an opportunity for effects, not a requirement for them.

Of course, we can explicitly model synchronous effects, where our interpreter knows to wait for certain data to become available before continuing. We can also mix, match, and layer our effects models, e.g. representing Free within Prog or vice versa. So, this isn’t a committed choice.

Some readers might assume shared memory effects are similar to `World -> World` functions, e.g. as used in Racket’s How to Design Programs. However, `World -> World` functions are driven externally, whereas yielding for effects on shared memory gives much greater control to the program.

I have enjoyed use of publish subscribe models, tuple spaces, blackboard metaphor, content addressable networking, explicit message queues, and otherwise more REST-ful approaches to effects common in procedural programming. I haven’t thoroughly explored this idea in context of purely functional programming, where we have constraints against aliasing and greater control over the sharing of memory… but I believe those properties will only improve the experience. I will continue to explore this idea in Awelon project.

Posted in Language Design | 2 Comments

Out of the Tarpit

I recently re-read Out of the Tarpit by Ben Moseley and Peter Marks. This paper describes a strategy for eliminating accidental complexity from software. Some of their conclusions:

  1. input is modeled by outside agents updating relvars
  2. output is modeled by outside agents observing relations
  3. logic is expressed by deriving relations and data
  4. most state is cached computations for performance
  5. no need for control flow, but performance hints OK
  6. avoid structured data abstraction and data hiding

IO is agent driven. Human users are included among these agents. But we may also have software agents that, for example, inject a twitter feed into a relvar, or observe a relation to determine feedback to twitter. Similarly we can represent posting to a forum, uploading a binary, moving a joystick, etc.. This is a very REST-ful approach to software architecture. Latency for output can be mitigated via Comet-style long polling or subscription techniques for the observers.

Modulo caching for performance, there is no hidden state. All inputs to the system are represented in non-derived relvars. Those relvars aren’t necessarily monotonic: we could represent spreadsheet-like behaviors by modifying relations in ad-hoc ways. But many use cases involve time-series data, which allows us to represent more conventional streams and logs and is especially cache friendly.

I find dubious the conclusions regarding structured data abstractions. The authors argue that when unneeded, irrelevant data is provided to a function, it becomes difficult to reason about what changes affect the output. I grant this. However the same argument applies equally to logic that, for example, drops columns or filters rows (aka project or select) from relation arguments.

Interestingly, I feel the authors only accidentally struck a critical point regarding the nature of complexity and reasoning, that algebraic composition is critical for reasoning at very large scales. By algebraic composition I mean a system that includes:

  • a set of composable elements, components
  • a uniform set of composition operators
  • algebraic closure when applying operators
  • compositional properties: invariant or inductive

Compositional properties allow us to reason (both intuitively and formally) about how systems will behave as we grow them, without requiring deep knowledge of the internal structure of the components. Functions and relations both exhibit algebraic composition.

Of course there are many more models out there with useful compositional properties: grammars, object capabilities, algebraic data types, CRDTs, free monads, etc.. By restricting structured data abstraction, it becomes difficult to represent what might be a more appropriate abstraction for a particular use case. 

Contrasting with Awelon project and Wikilon, my dictionary applications rely in a very similar manner on external agents for IO and cached computations as a basis for most application state. But I don’t provide any equivalent to the proposed separation between relvars and derived content. One motive for mixing is: transparent procedural generation and semantic compression, the ability to transparently template data or represent macros and elevate the level of user input without tweaking downstream data processing to receive a new type or source of input.

Also, one of my hypotheses is that computation is fundamentally mechanical in nature, that software is a physical medium like shaping clay or stone. Awelon project takes structure and flow of data to be essential complexity, important to make explicit, though we’re free to abstract it via yet another computation. I’m interested in relational programming within a functional system, but I gravitate towards solutions like µKanren as an EDSL for this purpose. Pure `state→state` functions, especially in context of substructural types and tacit actions seem a good fit for modeling human inputs, which are both mechanical and meaningful.

It will be interesting to see whether the ad-hoc purely functional structure and data abstractions of Awelon project and Wikilon fall afoul of the complexity predicted in this paper, or how much we use relational EDSLs in practice.


Posted in Language Design | Leave a comment

Wikilon Performance Strategy

Performance is a major concern I’ve struggled with over the last year. Peformance is important for a lot of applications, and to a lot of people. Without performance, Wikilon would never mature, never receive the users, attention, and funding it needs to mature into a productive tool.

Without realizing it, I had made an assumption: that I would implement my entire runtime in Haskell. And so I’ve been fighting with the Haskell language and garbage collector, wondering how to implement an efficient JIT when all my data structures are inaccessible in Haskell or VCache.

I’ve recently broken free of this assumption, and now I have a much better strategy for achieving performance. Distilled to a list of bullet points:

  1. Reimplement the core of my runtime in C.
  2. Implement a high performance single-threaded interpreter.
  3. Shared-nothing memory-regions per thread, and thread-local GC.
  4. Efficient copies: coarse-grained reference counting.
  5. Reimplement VCache for larger-than-memory values, specialized for runtime.
  6. Represent dictionaries as ABC values held within this new VCache.
  7. Support for Haskell-style par/seq parallelism.
  8. Leverage LLVM as a basis for compiling code (annotation driven or AOT).
  9. Develop accelerators: recognize subprograms and data structures, optimize.
  10. Focus on accelerators for collections-oriented programming and linear updates.
  11. Leverage structure sharing from VCache to support cached computations.
  12. Support gateway analysis of arguments to statically typed interpreter and compiler.

The potential exists to achieve world class performance while preserving the extremely simple semantics of Awelon Bytecode. Relevant ideas:

  • I can annotate that a list should be represented internally as a vector or a chunked list. I could do this without accelerators, but it wouldn’t offer much benefit. (Maybe it would improve memory locality.) With accelerators, many functions on lists – indexing a list, reversing a list, list length, appending two lists, splitting a list – can be reduced to a single opcode. Alternative list representations, then, would allow far more efficient implementations of these generic list functions: logical reversals, O(1) length, cheap splitting. This is incredibly useful for collections-oriented programming. (Specialized vectors for binaries would be an obvious extension.)
  • With reference counted vectors, or ‘affine’ vectors (from lists terminating with `[]f`), I can recognize vectors having just one reference. I can easily accelerate a “swap at list index” function. When performed on a vector with just one reference, I can transparently implement this as a O(1) update instead of copy-on-write. Similarly, I can optimize “split” and “append” operations such that when I ‘append’ on the same lines that I previously ‘split’ on, I can seamlessly rejoin the result into one larger vector. O(1) vector update is convenient for a lot of algorithms like union-find. O(1) split and rejoin is useful for many parallelism strategies. Developers may use ‘affine’ properties to enforce this discipline, but they aren’t required to do so.
  • If a subprogram is mostly written in terms of list processing and variants (e.g. matrix processing), then it may potentially be implemented on a GPGPU. We could annotate these subprograms, and automatically compile them for use on the GPGPU (e.g. via OpenCL or CUDA), or raise an error if the program isn’t understood. This technique would allow us to semi-transparently switch from CPU vector processing to GPU vector processing. Besides GPGPUs, we might get weaker benefits leveraging SIMD instructions on a CPU.
  • Coarse-grained reference counting means I don’t have a reference-per-value. Instead, I inject reference counts as distinct nodes in a tree-structured value. Whenever I try to read through such a node, I copy the values down to the next reference count nodes, which I increment. I can also heuristically inject more reference count nodes. Anyhow, this means the interpreter can treat most values as though it has exclusive access, modify things in place, and immediately mark garbage.
  • The specialized re-implementation of VCache promises to be both simpler and more efficient than my original. I implicitly batch my writes by having multiple latent requests in a memory region. Actual storage can wait until memory is tight, or a computation finishes. Transient values – e.g. those introduced while applying a stream of updates to a database-value – will not be written. Because storage is entirely transparent (no `VRef` type), I can reject storage of small values and obtain coarse-grained values that are good compression targets. (I plan to use Zstandard compression for large nodes.) If memory is full and there are no values indicated for storage, I can heuristically pick a few rather than resize memory.
  • Memory limits have been a major hindrance on performance scalability of purely functional systems. The ability to transparently model larger-than-memory values means we don’t need to escape pure functions to interact with an external database. We don’t need separate parallelism strategies for the IO. We can perform partial evaluations and constant propagation on a much grander scale. Databases become plain old values, no separate ‘management system’ required. To fully leverage these large values – e.g. so we can incrementally update and query a database as part of a command pattern dictionary application – we will need to leverage caching. We can cache partially evaluated code or common computations. Structure sharing aides in caching or memoizing computations on very large values: they’ll produce the same identifier when recomputed.

I am excited about the possibilities here. With access to GPGPUs and efficient collections processing, it is feasible to model physics simulations, machine learning, and more. The ability to do this within the framework of Wikilon gives us convenient spreadsheet-like characteristics, the ability to tweak parameters and recompute, the ability to fork a simulation and observe multiple alternative progressions side-by-side. With this performance, we can use Wikilon for real-time tasks by pumping time-series data into the dictionary and computing a proper set of actions.

This strategy does face some challenges.

For effective GPU accelerated computations, we’ll need to recognize functions not just for list processing, but also for a model of IEEE floating point numbers and fixed-width integers (arithmetic modulo 2^32). Floating point numbers are an awful model, so this will be painful. Developing accelerators isn’t too difficult in a technical sense (just recognize substrings of bytecode, substitute for an internal opcode and hand-written function) but it would be very nice to *prove* correctness of the substitution, and that’s probably a lot more difficult. And recognizing subprograms that can be compiled to the GPU can be done in a simplistic way at first (failing except for programs that obviously fit), but there will be pressure to accept the widest possible set of programs.

Further, it’s important to limit how much complexity we accept into our interpreters. Representing lists as vectors or even chunked lists thereof is reasonable and simple enough to implement. Representing lists as finger-tree ropes, OTOH, is far more dubious: it requires a significant amount of logic. Also, it wouldn’t offer significant benefits compared to modeling finger-trees above lists-as-vectors. An ABC runtime written in C has the advantage that we can easily reuse it outside of Wikilon, but we don’t want it to become bloated or difficult to re-implement.

But these challenges are surmountable: a lot of work, details to hammer out, but nothing new. I will probably study Haskell’s ‘accelerate’ for GPGPU computing, for example.

Posted in Language Design | 3 Comments

Wikilon Dictionary Applications

A repeating cycle of simplification and refinement has fine-tuned Wikilon since I introduced it over a year ago. One area where Wikilon has changed significantly is its application model, that is: how I expect to host server-side applications and debugging. The idea now is to use AO dictionary applications for everything, including edit sessions and issue trackers.

Dictionary Applications

An ‘AO dictionary application’ is any application whose state is represented within a healthy AO dictionary. An AO dictionary is a simple association of words to definitions. A ‘healthy’ AO dictionary is acyclic, fully defined, and well typed. AO is a purely functional subset of Awelon Bytecode (ABC) that uses of `{%word}` tokens to link definitions of other words. Linking a word is trivially equivalent to inlining its definition.

Humans usually don’t mess directly with AO bytecode. Instead, they leverage editable views such as Claw, a Forth-like command language, or a more structured variant.

Dictionary applications take this ‘editable views’ idea to a further extreme. Application views may freely involve multiple words and definitions from the dictionary, including evaluation of functions. One very useful pattern for dictionary applications that leverages evaluation is the command pattern. Represented in an AO dictionary, the command pattern looks a lot like:

@foo.v0 (creates initial foo)
@foo.v1 {%foo.v0}(command to update foo)
@foo.v2 {%foo.v1}(another command)
@foo.v3 {%foo.v2}(yet another command)
@foo {%foo.v3}

Here, the word `foo` represents an application object, perhaps a database or a canvas or a collection of communicating processes. With each new command, we add a new definition (e.g. `foo.v4`) then update `foo` to point to the new ‘head’. To obtain the current state of `foo`, we naively would evaluate `foo` from scratch in a provided context. Fortunately, the command pattern is cache-friendly: if we have already computed `foo.v2` in the same context, we can reuse this and evaluate just the last two commands. While command pattern would conventionally be used to present only the ‘current’ state of an object, some applications like REPLs (or a variant with constructive graphics, perhaps inspired from iPython-notebooks) might present each command together with some computed output.

Between hosting state in source, and the command pattern, dictionary applications have many nice features:

  • applications are persistent, portable, versioned like source
  • live coding and continuous testing become implicit
  • abstraction, compression, indexing of state and commands
  • excellent fit with HTTP and REST architectural styles
  • cache-friendly patterns for common updates
  • unlimited undo of commands and dictionary manipulations
  • back-in-time debugging via command or dictionary histories
  • trivial forking, cloning, snapshots of applications

Dictionary applications aren’t limited to explicit ‘refresh’. Trivially, an application could naively poll the dictionary to maintain a current view. But real-time feedback is feasible with Comet technologies like AJAX long-polling or subscription. Real-time views offer a convenient basis for both live coding and communication between agents that share the dictionary.

A dictionary is effectively a filesystem with a rich, computational semantics. Unlike conventional filesystems, a dictionary requires no special consideration for redirects, ad-hoc composition of content, or dictionary compression and procedural generation.

Note: Dictionary applications subsume and simplify a lot of my older ideas, including Wikilon filesystem, abstract virtual machines, extensible syntax via Claw-like views, and embedded literal objects via command patterns.

Real World Effects

Dictionary applications, in their natural environment, are implicitly multi-agent systems.

If our only agent is a human, we can model calculators, REPLs, interactive fictions, and other applications that remain ‘paused’ while waiting on the next human action. If we assume multiple humans, we can suddenly model more interesting applications like forums, shared chatrooms, chess, multi-user dungeons. Communication between humans becomes an implicit effect: humans observe updates caused by other humans. With software agents, or ‘bots’, we can support ad-hoc effects. For example, a human might add a request to a shared chatroom that requires querying an external database. A bot sharing the same chatroom then queries the database and injects the result back into the dictionary to fulfill the request. With software agents, we can model command shells, or time-dependent systems that we iteratively ‘step’ forward until a stable state is reached.

For software agents, effective control becomes important. This is achievable with attributes, i.e. annotations that have no runtime semantics. In the Claw view, I express attributes with parentheses around a list of words, e.g. `(license:BSD3 author:dmbarbour category:math)`. This desugars to a dropped block of annotations `[{&license:BSD3}{&author:dmbarbour}{&category:math}]%`. We can use attributes for all sorts of things: deprecations, todos, indexing, authorship or licensing where it matters, etc.. For bots in particular, we can use opt-in attributes like (allow:myChatBot) or opt-out attributes like `(noindex)`. I generally favor opt-in attributes in cases where a bot will modify the dictionary.

Concurrent Applications

Evaluation within a dictionary is deterministic. Despite this, it is frequently to our advantage to explicitly model concurrent systems, e.g. communicating processes or actors systems. A ‘concurrent system’ is reified as a dictionary object that we can ‘run’ a few steps at a time or interfere with in simple ways (e.g. injecting messages or twiddling state). Our applications are then modeled within these concurrent systems. This pattern buys us several features:

  • ability to iteratively ‘run’ the system a few steps at a time
  • implicit step-based debugging below granularity of ‘commands’
  • ability to interrupt or interfere with applications after any step
  • ability to extend applications by injecting new computations
  • animate, graph, or scrub rendered views of application state
  • easy integration of multiple agents with active application

Debugging concurrent applications hosted in a dictionary is much less unpleasant than doing the same for applications hosted on hardware. For example, there is no risk of heisenbugs. We’re free to step back in time via the command pattern. We can explore multiple valid progressions of non-deterministic concurrent applications (e.g. by using different PRNG seeds to shuffle messages).

The main feature we lose is hardware races for evaluation. We cannot rely on timeouts as a basis for hand-wavy soft real-time behavior. If we need well-timed applications, we’ll need to use explicitly timed programming models (that allow us to express how much time a computation should take, e.g. in terms of seconds or musical beats).

Managed Dictionaries

Preserving a long history for command pattern objects isn’t necessarily a problem. In cases where ‘commands’ are due to discrete and infrequent human actions, such as mailing lists and forums, it is entirely feasible to keep a complete history. Terabytes of storage are quite affordable these days. However, in the general case we might want something like garbage collection for our application histories. Besides GC, I also imagine wanting to automatically refactor, compress, and optimize code in our dictionaries.

As with real-world effects, automatic management of our dictionaries is feasible if we leverage software agents guided by attributes. In particular, I envision tagging words with three attributes to guide automatic management:

  • hidden – avoiding external references; may delete this word if unused
  • opaque – only care about behavior of this definition, not structure
  • frozen – committed to this definition; may inline into opaque definition

Opaque words allow for automatic refactoring, inlining, etc.. Frozen words may be inlined. Hidden words may be GC’d. We can model words with all three attributes, easily enough. Additional attributes might be leveraged to implement ‘gradual’ decay models (e.g. exponential decay). For example, an attribute could represent an expiration condition managed by an independent software agent.

Regarding ‘Big Data’ Applications

It is feasible for application objects to contain multi-gigabyte databases, FTP services, and the like. In conventional programming systems, we’d be forced to shove data out to an external database or filesystem to prevent memory overload. However, Wikilon is designed to address this problem without external dependencies. Instead, we annotate large values for external storage via `{&stow}`. Stowed values are serialized to cold storage but will implicitly be loaded and cached in main memory when necessary. My current implementation uses VCache, which uses a LMDB.

By stowing nodes in a large trie, we can easily model filesystems and databases without loading them into memory all at once. Using finger-trees, we can model larger-than-memory logs and queues. These large values are held by the cache. If we update something far upstream of our cached values we may need to recompute our application states. However, the common case would involve incremental updates and viewing only ‘final’ states.

Parallelism also helps with scalability because it can help quickly recompute a large application state in cases where our updates do not align nicely with our incremental models or existing cache. AO code can easily leverage Haskell’s par/seq parallelism with a few annotations. Long term, it is also feasible to optimize specialized subprograms akin to Haskell’s accelerate package.

Extracting Applications

As Wikilon is a development platform, we’ll sometimes want something closer to the ‘conventional’ edit-compile-test style. For example, we might want to provide a binary for use on an Android phone. This is achieved through an extraction process. The extracted binary is essentially a view on the dictionary, though not a particularly editable one. Conveniently, extraction could be presented to users as a simple GET on an HTTP link, or via a form that constructs an appropriate link (with various compilation flags).

Our extractor must know how to compile a popular ‘type’ of applications for an appropriate back-end. This is easiest for applications modeled as objects with the command pattern, and especially those modeling a concurrent system. Command pattern objects are easy to separate from the dictionary because their ‘state’ doesn’t involve parsing bytecode, and they clearly have ‘updates’. A self-describing object would be even better, i.e. one that can be queried for available commands. Concurrent systems enable us to have some commands take time while permitting interruption or interference, and thus are a good fit for many target systems.

The ability to extract applications in medias res is valuable. This allows us to easily ‘create’ new applications through incremental interactions with existing applications, like working with a spreadsheet or preparing a form so it is mostly filled. So, I think dictionary applications will help a lot even in these cases where we plan to extract a separate binary.

Posted in Language Design, Live Programming, Modularity, State, User Interface | 2 Comments

Command Language for Awelon

Claw (command language for awelon) is an editable view and syntactic sugar for AO (awelon object). AO is a subset of Awelon Bytecode (ABC) that limits tokens and texts. Tokens of the form {%foo} link definitions of other words in a dictionary. Dependencies within the dictionary form a directed acyclic graph. (Recursion is anonymous, via Z-combinator.)

Claw provides a Forth-like experience, suitable for simple REPLs and command shells, albeit with better support for higher order programming and weaker support for ad-hoc side-effects. AO is purely functional, but we can model effects monadically.

Claw is interesting because it’s extremely simple while remaining very effective.

Claw Semantics

Claw semantics is a trivial multi-step expansion, ultimately into bytecode. For example:

Code Expansion
2/3 2 3 ratio
4/10 4 10 ratio
3.141 3141 3 decimal
-1.20 -120 2 decimal
6.02e23 6.02 23 exp10
42 \#42 integer
-7 \#7- integer
[inc mul] \[inc mul] block
[1,2,3] [\[2,3] after 1]
inc mul \{%inc} \{%mul}
“foo” \”foo
~ literal

The lowest level of expansions are the escaped forms, indicated by prefix `\`.

  • bytecode escape, e.g. `\vrwlc`
  • token escape, e.g. `\{foo}`
  • text escape, e.g. `\”foo…\n~`
  • block escape, e.g. `\[inc mul]`

These expand almost directly into bytecode, excepting blocks which must be recursively expanded. Escaped forms shouldn’t see much use at a command line. Users should be encouraged to define words rather than issue commands at this level.

The claw expansion has two important properties.

First, the expansion is context-free. That is, the expansion of a number, text, block is completely independent of its surrounding context. Context free expansion has valuable locality properties, and guarantees that claw code preserves the same tacit concatenative composition and refactoring properties as the bytecode.

Second, this expansion is reversible. Bytecode may be ‘parsed’ back into the forms that generate them, e.g. starting with escaped forms at the bottom level, reducing some tokens back to words, and reducing sequences such as `2 3 ratio` to `2/3`. This reversibility property allows us to understand and treat claw code as a lens or editable view of the underlying bytecode.

Claw code (command histories, etc.) should always be stored in the canonical, fully expanded bytecode format. The burden of providing an editable claw ‘view’ then shifts to either the editor or an intermediate layer (e.g. a FUSE-based filesystem view, or a web server). Doing this improves claw’s portability and extensibility. Bytecode can easily be stored and processed by backends that know nothing of claw syntax. We could keep command sessions in an AO dictionary for regression testing and refactoring purposes. If an editor didn’t know about rationals it could still preserve meaning and present `2 3 ratio` to a user.

Command Sequences

Command sequences enable concise and aesthetic expression of streaming data, monadic DSLs and effects, incremental computation, and continuation passing styles. As such, they are essential to the expressiveness of Claw.

Trivially, [foo, bar, baz] expands to [\[bar, baz] after foo], with recursive expansion to [\[\[baz] after bar] after foo]. Each command separator – the comma – might informally be understood as a yield action to enable a client to perform intermediate work.

Claw Namespaces

In practice, we will develop alternative command environments (shells, REPLs, etc.) for different use cases. I imagine these developments will share metaphors and structures with existing environments, so we’ll need overlap in the command words used, albeit often with a slightly different implementation. Namespaces provide a simple means to reuse words in different contexts.

Claw has exactly one namespace for any given region of code.

The namespace is added as a prefix in the expansion of every word, including words integer, decimal, block, etc. used in code expansions. For example, word `integer` will expand to \{%bar:integer}` if we are in namespace `bar:`. Namespaces are the exception to claw’s context-free expansion constraint. The singular namespace is simple enough to avoid compromising locality and composition, though a little care is needed to preserve namespace information if refactoring at the claw code layer.

A namespace annotation may appear at any point in a Claw command stream. A namespace annotation has the form `#bar:`, meaning `switch to namespace `bar:`. Claw reserves the `#` prefix for this purpose. To guarantee reversibility, a namespace annotation expands into the annotation `{&ns:bar}` at the bytecode layer.

A change in namespace lasts until we either reach the end of the current block or we again change namespaces. Blocks become implicit regional delimiters for namespaces, e.g. for `[#bar: 42 foo]` the `#bar:` annotation only applies within the block. This supports a common use case where we apply a block to a different context than where we constructed it.

Given the restriction to a single namespace, there is never any ambiguity about which foo you’re invoking. There is no need to consult a dictionary (e.g. to search for a defined word) to determine how words link to each other. There are no boiler-plate import lists. This restriction contributes wonderfully to claw’s simplicity and concision.

OTOH, every namespace must define its own version of every common word. If performed by hand, this would be a huge investment of a developer’s time and energy just to catch up and keep up. This cost might be mitigated by developing a few tools or software agents to automate the process.

Claw Extensions

Claw is extremely extensible.

While my immediate focus is on simple numbers and small texts (the original ‘AO’ feature set), it is feasible to support matrices, tables, and other structured data. Structured programming is also viable, e.g. conventional loops and if-then-else expressions converting to use of blocks and keywords (e.g. expansion `\[cond] \[body] while_do_` could correspond to a conventional while loop). Similarly, sugar for monadic expression could simplify integration of ad-hoc sequential effects models.

Any extensions must preserve the properties that expansion is context free (hence local) and reversible. But those are not difficult constraints to work with.

Of course, there are limits on what we can effectively support in context of line-oriented command text. Some syntax extensions, such as matrices or structured programming, would require a page-oriented programming environment. Something akin to iPython-notebook environments, wikis, or structure editors, might be appropriate. With structure editors, an interesting possibility is interactive extensions – e.g. checkboxes, sliders, color pickers, canvases.

Aside: This claw-structured bytecode (based on parsing bytecode into a structured editable view) is independent of the AO structured definitions (based on evaluating bytecode into a manipulable value structure). I hope these orthogonal kinds of structure will augment each other, or at least address different niches. But, with extensions, there is some overlap in potential utility. I’ll need to see how this works in practice.

Command Shells and Effects Models

Commands and sequences thereof can manipulate an external state resource, e.g. a model of a REPL. With command sequences, we can also model a multi-step system with many intermediate outputs, which is convenient for animated renders, e.g. Turtle graphics.

See Wikilon Dictionary Applications for a more up-to-date view.


Claw is simple, transparent, composeable, and extensible. These are excellent properties for my long term goals.

Edits: Modified model for command sequences.

Posted in Grammars, Language Design, User Interface | 1 Comment

Purely Functional Performance

An argument I’ve encountered umpteen times, most recently on LtU, is: imperative programming performs better than functional programming because it has O(1) updates, while updating functional structures is typically O(lg(N)). I’m now weary of repeating the counterpoints so, like a good programmer, I’ve decided to refactor into a blog for future linking. Here goes:

In any physical system, memory is laid out in a 3-dimensional physical space. Consequently, speed-of-light delays between memory and any centralized processing unit increase by at least a factor of N^(1/3). Further, there is a lg(N) factor if we give every location in the space a unique address. Thus, if we account for known physical constraints, the very best we can theoretically do for centralized processing of a data structure is O(lg(N) * N^(1/3)). (Or maybe that `*` should be a `+`.) In practice, memory tends to be laid out in two dimensions (surface of a board, surface of the Earth) so the factor is actually N^(1/2).

Usually, we ignore these physical factors. However, they’re more difficult to ignore today than they were historically. Today, we have three levels of cache, NUMA, distributed shared memory models, using memory itself as cache for the disk, and so on.

If we do account for these physical factors, then functional data structures have the same asymptotic properties as imperative ones.

Conversely, functional structures can easily model physical memory – e.g. use a trie where every key is a 64-bit number. In this case, we can guarantee that the access and update is O(1) in the same asymptotic sense as is the case for physical RAM. Any update to the trie requires creating at most 64 new nodes.

So, no matter which perspective you take – physical or logical – functional code isn’t asymptotically worse than imperative code.

However, FP does tend to be less efficient in every relevant absolute sense. Allocating 64 nodes is a lot more expensive than in-place mutation of RAM in a 64-bit address space. So, I do acknowledge that performance is a concern.

Performance can be divided into a few different concerns, such as efficiency, utilization, scalability. Efficiency: do we use our resources well? Utilization: do we use all of our resources? Scalability: can we easily throw more resources at the problem? (or gracefully take some away?)

While FP isn’t doing so well with absolute efficiency, it can utilize resources and perhaps has greater potential for scalability (due to ease of parallelization, and location-independence of computations).

Further, due to structure sharing, FP performance tends to be more stable under external scenarios that might benefit from acquiring atomic snapshots of machine state: orthogonal persistence, time travel debugging, state visualization, live programming, rollbacks for exception-safe consistency, parallel reads and mirrored servers, etc..

So, while performance certainly is a concern, I believe FP fares very well – both asymptotically and absolutely – after we step back and look at the big picture.

Posted in Language Design | Leave a comment

Abstract Awelon Virtual Machines

I’m still working towards Wikilon, a wiki-based development environment and initial software platform for the Awelon project. The design is still fluid, and has taken me in some interesting and unconventional directions on the ‘development environment’ side.

On the ‘software platform’ side, I was contemplating a filesystem model, for long-lived data and applications. But I was concerned: how do developers disentangle their app from the filesystem? So I shifted to Wikilon hosting abstract virtual machines (AVMs) that host applications. AVMs provide boundaries for entanglement, forcing developers to address latency, concurrency, and disruption. This week, I decided I don’t want the heavy weight of a filesystem in every AVM. AVM state is now just a value, and a trie value is a decent model of a filesystem if developers want to use one. The development of VCache makes it entirely feasible to model filesystem-sized values containing gigabytes of data.

I selected message passing as a basis for communication between AVMs. I’m not fond of message passing, but ease of implementation and integration with existing systems sold it to me anyway.

Some interesting differences between AVMs and other message passing systems:

  • AVMs cleanly separate state from behavior. The behavior of an AVM is a pure function of the form `(InMessage, State) → (OutMessages, State)`. All messages travel through this function. This separation makes live programming easy: bind behavior to a Wikilon dictionary, such that edits affect behaviors in real-time. Similarly, we can apply a structure editor to browse and directly manipulate state.
  • Every message carries a capability-secure context: every InMessage is a (Context, Content) pair while every OutMessage is a (Capability, Content) pair, where the capability is an ADT that keeps context opaque to the sender. An AVM is (usually) given a pure function `self :: Context → Capability` so that it can construct capabilities needed for query-response and other feedback patterns. An AVM constructs ad-hoc, fine-grained capabilities on the fly.
  • AVMs cannot create/spawn/fork/etc. new AVMs as a primitive. However, AVMs may host other AVMs hierarchically, using context to route messages. AVMs tend to centralize a lot of state under just a few objects, which has many advantages for debugging. That said, spawning can be modeled in terms of a capability to ask another machine to host an AVM.
  • A friendly scheduler is specified. Messages between AVMs are implicitly grouped into batches. Output messages resulting from a batch are sorted into a new batch for each target AVM. Batches are processed atomically, and messages within each batch preserve order. Batches between machines also preserve order, using timestamps. Thus, developers don’t need to think about worst cases for non-determinism.
  • A ‘common capabilities’ concept is also supported. For example, if access to UTC time is provided by a capability, and we send this capability to a remote host, said host may recognize this capability as a common one and provide a local implementation. This is discretionary; the host may instead send the message across the network to get UTC time at the origin. This mechanism is highly extensible: time, random numbers, high performance matrix multiplication, HTTP access, multi-homing, etc. can be modeled this way.
  • I’ll make a reasonable effort at timely, reliable delivery of messages, but I won’t guarantee delivery. If a message cannot be delivered easily, I’ll drop it. If a message has a much larger latency than physically expected, I’ll drop it. Dropping messages allows a simple and scalable implementation in the face of varying connectivity, congestion, migration, administration that might take a machine offline, etc.. AVMs can react intelligently to network conditions via special capabilities that can report delivery status or provide alternate behaviors. Developers are encouraged to explicitly model mailboxes and tuple spaces and other robust offline delivery mechanisms where needed.
  • In addition to state and behavior, every AVM has a ‘signal :: Context’ value. This value is used by the hosting environment to get an AVM running (an ‘init’ signal) or support graceful shutdown (a ‘stop’ signal) or indicate other concerns (e.g. low power) or to inject information about the environment or the Self function. Basically, it’s an initial capability. It provides an extensible set of generic AVM methods.
  • At the toplevel, AVMs will be given global names – a secure hash of a public key – with routing based on a distributed hashtable. This makes it easy to create tons of fine-grained AVMs, to move them across physical systems, and to secure them against man-in-the-middle attacks without relying on a central naming service. If developers want common names, e.g. similar to domain names, these might be supported via NameCoin and similar protocols. (I’d rather avoid depending on ICANN or CAs.)

As with other message-passing systems, effects are modeled in terms of sending messages to resources masquerading as machines. Access to Redis, S3, AMQP, XMLHttpRequest, sockets, user interfaces, etc. could be modeled this way. Though I’ll initially focusing on just whatever I need for web services and web sockets.

Addendum: to emphasize a few points

I really love the simple, pure, non-monadic nature of the step function.

This sort of straight-line code is easy to express in a first-order manner. It is abundantly clear to developers that nothing happens – no output, no new input, no progress – unless the function terminates. It is clear that developers cannot wait on a message in response to a send unless they model waiting in a higher layer abstraction. The loop itself is clearly separated from state, which simplifies both live programming and visualization or direct manipulation of machine state.

I believe AVMs will prove highly scalable, and this is greatly aided by their purely functional nature and clear simplicity. Read-only messages are easy to parallelize. If we notice (by comparing before and after states) that many messages are read-only, we can heuristically switch to parallel modes and use compare-and-swap or other rework models to resolve read-write conflicts. Developers can leverage this by dividing AVMs for read-mostly loads or arranging for a single writer to avoid read-write conflicts. This sort of parallelism is highly scalable, e.g. can be implemented through replicating the entire abstract machine. Read-write messages are more difficult to parallelize. But if we have expensive computations distributed across small partitions of state, we can use pure parallelism to be processing many updates at one time. Normal sharding techniques also apply.

The real challenge will be to make these AVMs competitively efficient, i.e. by developing an effective compilers or JIT for Awelon Bytecode, leveraging annotations for memoization, etc..

Posted in Concurrency, Modularity, Open Systems Programming, Security, State | 10 Comments