Labeled Data: Records and Variants

Records and variants are basic labeled data types. A record might be expressed as `{x:12.1, y:13.7}`, having a value for each label. A variant might be `Damage:11.3` or `Say:”Hello”`, and involves selection of one label and an associated value. We can generally combine these to have a variant of records or a record of variants. Labeled data is convenient insofar as they are self-documenting (human meaningful) and extensible. Most modern programming languages support labeled data types as a built-in feature. A few have explored fluent extensibility under the names “row polymorphism” or “polymorphic variants”.

But assume we have a much simpler language. This language readily supports simple binary products (a*b) and sums (a+b). A binary product is a simple pair of values like `{fst:12.1, snd:13.7}` while a binary sum is a simple choice of two like `Left:11.3` or `Right:”Hello”`. Effectively, we have spatial labels that are neither human meaningful nor extensible. For my Awelon project, this isn’t a hypothetical scenario. Awelon can readily Church encode binary products and sums, and can even provide a conventional syntactic projection, but has no built-in type-level support for labeled data.

Given simple binary products and sums, can we recover the benefits of built-in labeled data types?

I think we can do that and better. But first a short detour. Type `0` (aka void) has no values (perhaps excepting `error`). Type 0 is the identity type for sums, in the sense that type `(0+a)` or `(a+0)` is equivalent in many ways to just `a`. Similarly, type `1` (aka unit) is an identity type for products, and `(a*1)` or `(1*a)` is equivalent in many ways to just `a`. But there are some important *structural* differences. Any non-error value from `(0+a)` will be encoded as `Right:_`. With deep sum type structures such as `(0+(0+((0+(a+0))+0)))` we end up encoding values as `Right:Right:Left:Right:Left_`.

One idea, then, is to encode our label within this Left-Right-Left sequence. One viable encoding is `(RL|RR)*LL` where `RL` means `Right:Left:` and encodes a zero-bit, `RR` encodes a one-bit, and `LL` encodes sequence stop. (This particular encoding has a convenient property of being simple and self synchronizing.) Given our string of ones and zeroes, we may subsequently encode labels using ASCII or UTF-8. This gives us a basis for variants. Similarly, records can be represented by encoding a trie on the label we use for variants. For the earlier example, directly encoding the path would give us a corresponding structure `(1*(1*((1*(a*1))*1)))`. (We might explore alternative encodings to make it easier to extend or enumerate a record.)

Those 0 and 1 types essentially represent the ‘extensibility’ options for our variants and records. We get more choices for variants, more slots for records, while the label encoding itself ensures unbounded extensibility. Conveniently, we can also use a variant label to access a record, which provides a simple basis for ‘case’ or ‘switch’ expressions. We do additionally need syntactic sugar for working with these otherwise onerous structures, and likely a compiler that accelerates them. But there is little need for special type level support. And by modeling these records and variants, we can more readily get various polymorphic features, and the ability to work with labels themselves as first-class, typeable structures.

Anyhow, while I found this idea interesting, and I haven’t seen other people approaching labeled data this way, I wouldn’t be surprised if it’s mentioned in papers older than I am. I’ll be pursuing this further in Awelon project.

Posted in Language Design, Types | Tagged , , , , , , | Leave a comment

Reactive Process Networks

Kahn Process Networks are a very nice sweet spot for functional purity and concurrency, and are widely applicable. I’ve recently considered them as a viable alternative to monadic effects models. However, KPNs are missing an important feature: the ability to merge asynchronous inputs from different channels. There is no consistent sense of time for messages within a KPN.

Fortunately, it is easy to add time to KPNs, and thus have reactive process networks:

  • Every process and message is stamped with an implicit time value.
  • Channels have logical latency, adding that latency to time stamps.
  • READ may return Nothing if no messages are available at the process time.
  • WRITE will stamp the outgoing message with the process’s current time.
  • In addition to READ and WRITE, a process may WAIT on a set of channels.
  • WAIT will advance the process time to the earliest message in the set.

For efficient distributed implementation, we’d implicitly use synchronization messages to report the advance of logical time for a waiting process. Intriguingly, it is feasible to perform some optimistic concurrency: determine likely future outputs on an assumption that no last-second messages on open input ports will change things. And of course, we’ve also solved the expressive weakness of KPNs. We can merge asynchronous messages into a stream, or represent clock-like behaviors – loops with latency.

Another interesting point is that basic KPNs don’t support bounded-buffer ‘push back’ because it can result in datalock. We can implement them that way and accept the risk, but it isn’t part of Kahn’s original model. But with reactive KPNs, we don’t as much need bounded-buffers because we can bound computations in ‘time’, e.g. by evaluating a KPN up to a given time or controlling the advance of time for open input channels.

The time stamps are logical, and are never directly observed by a processes. Time stamps and latencies may use simple natural numbers. We can ‘normalize’ time stamps within a network, e.g. by subtracting the smallest time stamp in the network from all time stamps in the network. For integration with real-time systems, our logical time stamps may have a simple mapping to wall-clock times, such as logical microseconds. We can also use logical latencies to guide physical distribution when evaluating large process networks.

The original KPN behavior is supported. We can always WAIT before READ. Or we can simply use zero-latency wiring and never advance time on open input channels. While READ does not logically WAIT, it must still implicitly wait until we determine whether or not more messages will arrive at a process’s current time.

Reactive process networks resolve my few remaining concerns with using KPNs as both a primary effects model and as a foundation for first-class high-performance distributed programming within purely functional computations. Reactive KPNs could also serve as an excellent substrate for my reactive demand programming model.

Posted in Concurrency, Distributed Programming, Language Design, Reactive Demand Programming | Tagged , | Leave a comment

Lambdas in Tacit Code

While tacit programming is an excellent default, it is not suitable for all use cases. Attempting to use tacit programming everywhere results in pointlessly onerous data shuffling. A few local names, in the form of lambdas and let expressions and lexical closures, can simplify a useful subset of code, making it easier to read and write. Many tacit languages, including Factor and Kitten, provide an ‘escape’ from the tacit style. Factor uses a macro rewrite, while Kitten treats this as a primitive feature with support from the syntax and runtime.

I am developing a tacit concatenative language, Awelon. Unlike conventional programming languages, Awelon takes an approach of storing code with a simple, uniform syntax in a database, and minimizing the core language. Awelon has become considerably simpler since last time I discussed it on this blog. If you’re interested, I encourage you peruse the linked documentation. The core of Awelon is just four primitive combinators:

Awelon Primitive Combinators
[B][A]a == A[B]           (apply)
[B][A]b == [[B]A]         (bind)
   [A]c == [A][A]         (copy)
   [A]d ==                (drop) 

Awelon additionally has syntactic sugar for natural numbers and text literals, and parenthetical annotations like `(nat)`, `(memo)`, or `(par)` for performance and debugging. But all observable behavior and data is ultimately defined in terms of the four combinators. All data is logically Church or Scott encoded as functions, even if accelerated under the hood. Because Awelon’s syntax and semantics is very simple at the storage layer, I leverage editable views for presenting a more sophisticated view to the programmer. For example, a proposed view might expand decimal number `3.141` to `[[3141 0 integer] 3 decimal]`. And `-4/6` to `[[0 4 integer] 6 rational]`.

For Awelon, an escape from the tacit style should also be an editable view – a bit like Factor’s macro expansion, but reversible because I’ll be storing the tacit representation. How to accomplish this, exactly, didn’t really click for me until yesterday when I was reviewing the Wikipedia article on combinatory logic and specifically the translation from lambda calculus to SKI combinators. I can apply a similar translation to extract lambda variables from Awelon code:

Extract to Argument
   T(X,E) extracts X from E
   ∀X,E. [X] T(X,E) == E
   T(X,E) does not contain X

T(X, E) | E does not contain X => d E
T(X, X) => i  (where [A]i == A)
T(X, [X]) =>
T(X, [E]) => [T(X,E)] b
T(X, F G)
  | only F contains X => T(X,F) G
  | only G contains X => [F] a T(X,G)
  | otherwise  => c [T(X,F)] a T(X,G)

We can implement i = [][]b a a d

This is a simple algorithm that generates efficient code. The translation is ‘reversible’ by partial evaluation of `[X] T(X,E)` to propagate the variable back into the rendered view. The rules can be simplified a bit (no need for `i`) if we assume X is a value and instead extract to `X T(X,E)`.

I find Factor’s syntax for named locals to be headache inducing. But Kitten’s `→ X;` assignment syntax is reasonably aesthetic and effectively supports both lambdas and let expressions. Consider an adaptation of Kitten’s syntax for named locals to Awelon:

View: → X; X
Source: (λ X) i

View: → X Y Z; X foo [Y] bar
Source: (λ X Y Z) d [i foo] a bar

View: 7 → X; X X * X -
Source: 7 (λ X) c a c a [*] a i -

Here the `(λ X)` is an annotation or comment that tells our editable view that we intend to present the code as a lambda expression with a specific variable name ‘X’. The details of its representation aren’t essential. What does matter is that we can use the comment to systematically recover the view from source:

Source: (λ X) i
View: → X; [X] i  
  => → X; X

Source: (λ X Y Z) d [i foo] a bar
View: → X Y Z; [X][Y][Z] d [i foo] a bar
  => → X Y Z; [X][Y][i foo] a bar      (dropped Z)
  => → X Y Z; [X]i foo [Y] bar         (apply a)
  => → X Y Z; X foo [Y] bar            (apply i)

Source: 7 (λ X) c a c a [*] a i -
View: 7 → X; [X] c a c a [*] a i -
  => 7 → X; [X] [X] a c a [*] a i -
  => 7 → X; X [X] c a [*] a i -
  => 7 → X; X X [X] [*] a i -
  => 7 → X; X X * [X] i -
  => 7 → X; X X * X -

It seems to work, though we may need to limit partial evaluations based on the patterns in ‘extract to argument’ to avoid evaluating parts of code irrelevant to the lambda.

Awelon programmers will have named locals where convenient without entangling names with syntax or semantics. This has me feeling more confident in my decision to favor concatenative combinators as a simpler base for computation than lambda calculus.

Addendum: To avoid unnecessary copying, we can introduce specialized rewrite rules for conditional behaviors of general form `T(X, [onT][onF]if) => [T(X, onT)][T(X, onF)]if`. I assume `if` will drop one branch and inline the other based on a boolean on the stack. Thus, I leave `X` on the stack rather than copy it into both branches.

Posted in Language Design, User Interface | Leave a comment

KPNs as an Effects Model

In this article, I motivate a variant of Kahn Process Networks (KPNs) with bounded channels as a better – more efficient, scalable, composable, securable, serializable, comprehensible, and debuggable – approach to modeling effectful application behavior. Especially in context of purely functional languages and conventions today.


In context of purely functional programming, Haskell pioneered a simple and effective approach to effects by simulation of imperative programming: model a data structure that returns either a final value, or yields with a (request, continuation) pair. By making this data structure opaque, inaccessible to user code, a compiler aware of the structure can optimize and inline requests heavily to achieve acceptable performance. In Haskell, this opaque data structure is called `IO`. (This trivial pattern is also called by obscure and scary names from category theory, but that hasn’t done Haskell any favors.)

One fundamental weakness of this design, however, is the singular nature of that `(request, continuation)` pair. Implicitly, this becomes a performance bottleneck: we’re performing only one ‘request’ at a time. Fortunately, we can work around this limitation in a few ways:

  • Asynchronous requests. Rather than awaiting a result immediately, we may continue with a ‘future’ reference for the result that can be used in a future request. The request itself may then be processed in parallel with ongoing computation.
  • Multi-threading. We can support a request to fork a thread that itself can make more requests. Further, we can support requests for communication between threads, e.g. via channels or shared mutable variable.

And this is essentially what Haskell does. We have our `forkIO` request, and a panoply of communication methods (IORefs, MVars, TVars, channels modeled with them, etc..). Asynchronous futures are frequently implicit to Haskell’s lazy evaluation model.

Unfortunately, these workarounds come at significant cost. Those futures and variables deeply entangle the application with its environment, which hinders serialization, persistence, and debugging. Multi-threading introduces non-deterministic behavior, which can hinder robust testing or reproduction of error conditions.

A second major weakness of this design regards the opacity of that `(request, continuation)` pair.

Assume two applications with opaque effects – `foo.main()` and `bar.main()`. We could compose these sequentially, or in parallel, or invoke one from within the other. However, we cannot control the interleave of requests. We cannot intercept requests in a mockup environment for testing, security, precise embedding. It is not possible to construct wrappers around an application to tune requests and responses for a different environment. It is not possible to cleanly halt an application. Ultimately, this hurts composition, testing, security, portability, and process control.

Again, there are workarounds – e.g. sandboxes, ad-hoc dependency injection, object capability security. But again, there is a cost – this time to those performance benefits we originally received from compiling with opaque effects.

Kahn Process Networks

KPNs are a simple model consisting of processes and channels. A process is limited to two actions: read from a channel, or write to a channel. A channel carries a FIFO queue of messages, and may have at most one reader and at most one writer. Reading from a channel will block if there is no message available. Generally, writing to a KPN channel does not block. But I believe a variant of KPNs where each channel is annotated with a maximum capacity is superior for every practical use case.

Construction and composition of KPNs is readily achieved in the style of Flow-Based Programming (FBP). Instead of directly naming a channel, a process will read or write a port. Separately, a configuration wires output ports from one process to input ports on another.

Aside: KPNs and FBP are similar. Some important differences: FBP has bounded channels by default, ad-hoc side-effects, and non-deterministic ‘merge’ for multiple writers into a shared wire.

KPNs can be used as an effects model.

Open (aka dangling) output channels can be filled with requests to be read by the KPN’s interpreter. Responses can be written to open input channels. A lightweight, declarative configuration can tell our interpreter from which channels it should read requests, and in each case to which channel it should write responses. This might be informally understood in terms of declaring a subset of channels as ‘effectful’ when wiring up our KPN processes.

These ‘effectful’ channels have some nice features:

  • Multiple effectful channels represent concurrent requests.
  • Multiple responses can easily be injected concurrently.
  • An individual effects channel naturally sequences its effects.
  • Requests and responses are buffered and batched up to channel capacity.
  • Internal process model supports parallelism and distribution.

This offers a significant improvement over the singular `(request, continuation)` pair implicit to imperative programming.

Instead of a peephole view of our computation, our effects interpreter may handle many concurrent requests and provide many concurrent responses, and our KPN’s continuation is inherent to the unblocking of processes when our environment drains full request channels or fills empty input channels. Because buffering, parallelism, and concurrency are pervasive, we don’t need workarounds like futures, threads, or shared variables for communication between threads.

By avoiding those workarounds that entangle evaluation with the environment, the resulting system becomes much simpler, easier to serialize and debug.

Note: Alternatively to effects channels, we could augment our process type to support requests on the environment. However, this can trivially be represented by adding two ports to each process (one to write requests, one to receive responses), then wiring an effectful channel between them, then immediately reading a response after each request. Effectful processes would be less expressive for asynchronous behavior, offer less opportunity to intercept and translate requests for a different environment, and cannot buffer more than one pending request per process. Effectful channels are by far the superior design.

KPNs in context of Purely Functional Programming

In context of purely functional programming, KPNs have more nice properties.

A KPN can easily be described by an immutable value – e.g. a collection of named processes and wires, with a list of messages pending on each wire. Granted, this may prove a fair bit more difficult to represent in statically typed PLs, due to the challenges surrounding heterogeneous collections. I expect clever people can get it done.

Evaluation of a KPN is pure and deterministic. This might be represented as a pure function, effectively `evalKPN : KPN → KPN`. Of course, squeezing the most parallelism and performance from the KPN may require special attention from our runtime or compiler.

Intriguingly, evaluation of a KPN can easily proceed in parallel with further actions upon it. This could be expressed by a variant on the evaluator. Something like: `stepKPN : StepKPN k a → KPN → k (a, KPN)`.

During stepKPN, our KPN is running in the background and evaluating, e.g. using conventional threads and queues. The final returned KPN is fully evaluated. But the `StepKPN` type would represent a program that may read and write channels, potentially modify the KPN (add or delete processes and wires, adjust wire capacity, etc..). Importantly, every action on the KPN is deterministic. For example, a read will return immediately if possible (if the channel has data), but if the channel is empty we must wait until input becomes available or until all upstream processes block. Besides actions on the KPN, StepKPN might support operations in some surrounding context `k`.

Effects can be supported either way.

With `evalKPN` we might use a simple tactic to repeatedly:

  • extract pending effect requests
  • inject all available responses
  • initialize evalKPN in parallel
  • process the extracted requests

This would give us frame-oriented effects, with big batches of effects being processed efficiently between frames. This is easy to reason about, and offers adequate parallelism on a single system. Unfortunately, it doesn’t scale very well to distributed systems. However, we could tweak this to model a partitioned KPN system, each partition running its own frames.

With `stepKPN` we can very precisely extract requests, perform effects, and inject responses in a fine-grained manner. This could work very well because we can tune our strategy for reading and writing of open channels based both on the problem and on meta-knowledge of the KPN’s structure.

A third option is viable: associate effects handlers, e.g. of type `request → IO response`, with each effectful channel. Then run the KPN in context of IO. This could maximize parallelism and scalability, albeit at a cost to deterministic ordering of concurrent effects and reproducible system behavior.

That last option would most generically enable KPNs to replace simple imperative IO as the effects model, since it would effectively have the full power of multi-threading but with advantages of inherent batching and reduced complexity.

Posted in Concurrency, Language Design, Modularity, Open Systems Programming, State | Tagged , , , | 8 Comments

Program Rewriting

Rewriting is not an unusual basis for programming language formalization. Lambda calculus is frequently expressed in terms of rewriting, as are combinator logics. However, rewriting seems to be unusual at the implementation layer. Instead, we model environments, stacks, none of which are represented in the original program. We abort computation upon encountering free variables or stack underflow, neither of which are errors when the same program is placed in a different context. We worry over control flows and eager vs. lazy evaluation and parallel strategies, all the while struggling with constant propagation and partial evaluations.

Suppose we implement a language primarily by rewriting. We can feed our program to our evaluator, together with a space and effort quota. It chugs and churns and rewrites for a while, then outputs a different representation of the same program – something we can view and edit just like the input program. (To clarify, this is not the same as generalized term rewriting. The evaluator needs only one specific set of rewrite rules, those for our language semantics.) What have we gained?

  • Simplified resource management. Even upon hitting quota limits, we receive a useful, partially evaluated result. This is similar to a memory dump, but is wholly independent from the evaluator. We can continue evaluation, perhaps using a larger space quota or slicing a large computation into small, independent pieces.
  • Simplified debugging on error. If our language doesn’t know what to do with (“cat”+3), then we’ll simply carry that expression into the output, giving developers a concrete view of the error. Because there is no context in which that is okay, our evaluator may even reify it as `error(“cat”+3)` so errors are easily counted, highlighted in red, discovered by text search, etc… Our program output on error isn’t a horrifying, illegible memory dump, but rather is the same language as our input and may be rendered and studied using the same tools and text editors.
  • Improved debugging HCI. All program ‘state’ is explicitly modeled as part of “the program”. Thus, there is never any need to hover over or query a variable named `x` to read its value. Instead, you’ll plainly have an `x = 1234` somewhere in the program. Similarly, if a program is evaluating multiple instances of a subprogram (e.g. due to recursive loops or parallelism), you’ll be able to see those instances, which effectively gives you a rendering of the full stack and every parallel computation. Of course, we’ll need to use progressive disclosure techniques for rendering of very large structures. But that can be achieved by simple heuristics or data type annotations, is entirely data oriented, and is a much easier problem to solve.
  • Simplified breakpoint based debugging. We can reify breakpoints within our language as `@foo(P)`, configured by name as an additional evaluator input. If inactive, @foo(P) returns P. If active, our evaluator rewrites to `@foo:123(P)` (with `123` unique in the output) and treats the breakpoint as a barrier to use of P, but may evaluation of independent subprograms. Our output – the partially evaluated program – can be rendered as plain text (or using our normal program editor), and active breakpoints are easily highlighted, found by searched, and deleted by hand or by pattern.
  • Simplified log based debugging and profiling. We can trivially configure some breakpoints instead as ‘log points’. For example, @console(P) might both return P and print P to the console. A simple convention would an extensible set of names default to logging (e.g. @log:foo). Of course, logging doesn’t provide much context. We can do better with history based debugging.
  • Simplified history based debugging and program ‘animation’. We can animate on arbitrary breakpoints like `@foo` by simply evaluating as far as we can go, taking a snapshot, dropping all the `@foo:*` breakpoints, then repeating. We can animate on multiple breakpoints choosing a strategy: pararallel (@foo and @bar release together), hierarchical (as many @foo frames as we can manage, then one @bar frame, repeat), interleaved (@foo @bar @foo @bar), etc.. Because we’re halting on developer selected breakpoints (rather than time quotas) the intermediate programs will be predictably structured, easy to analyze and debug. Ultimately, we have many megabytes or gigabytes of highly compressible program-snapshots that could easily be rendered into a GIF image.
  • Simplified partial evaluation and staging. With rewriting, partial evaluation is the normal case, i.e. we can take any subprogram and evaluate it as far as it goes, independently of input. With a few annotations like `asap(P)`, we can tell an evaluator to prioritize evaluation deep within a program, giving us effective control of staged computing, no need to juggle anything.
  • Supports expressive spreadsheet HCI. In a typical spreadsheet, cells are inexpressive programs that reduce to an anemic data model. With program to program rewriting, cells can express anything our language can express – e.g. functions, objects, databases. Even when a cell contains a function, it’s the partially evaluated function, which allows for convenient inspection. Given expressive types, we can support have flexible rendering options. Using the ‘animation’ idea, spreadsheet cells could additionally render their own evaluations in a loop for anyone who is interested.
  • Simplified integration of DSLs. In context of rewriting, it is easy to leverage partial applications into partial evaluations. Thus, DSLs need be no worse than the host language for performance, not even for external DSLs. This does leave some interesting challenges, such as tracing errors back to lines in the original DSL code. (I do have some fledgling ideas about annotations for tracing, e.g. configuring a @gate to add hidden, contagious markers to data, perhaps with a limited generation count. Such a technique would be useful for active debugging in general.)
  • Simplified linking. Programs are their own representations, so we only need to deal with the problem of linking values. Linking values is a much easier problem compared to linking of mutable things, e.g. leveraging secure hash URLs. With program rewriting, a runtime link failure isn’t a huge problem. The only consequence is evaluation stalls, much like failing on an effort quota. The output remains a valid program and may easily be continued later, e.g. when the network is up.
  • Simplified big data handling. Because runtime linking is lightweight, secure, and fail-safe, we can feel free to tuck things behind links at runtime. For example, if we don’t expect to need P any time soon, we could say `stow(P)` and we get back a link reference to P that may later be loaded. This technique can be directly applied in the construction of data structures – log structured merge trees, hitchhiker trees, finger-tree deques – that may be vastly larger than working memory.
  • Simplified compilation model. When linking subprograms by secure hash, we are not restricted to loading the plain text representation. We may instead load a cached representation that is heavily optimized for the runtime, where the binaries are compact and the functions run fast. We may use annotations like `jit(P)` or `optimize(P)` to encourage the runtime to perform compilation. Interestingly, linking subprograms by secure hash gives us shared objects by default.
  • Simplified IO. Assume we model a `type Process = Request → (Response, Process)` and hook it up to a network socket. Request, Response, and Process are programs, because all values in program rewriting are represented by programs. They come with plain-text representation, maybe even a type checker. This eliminates a lot of ad-hoc make work surrounding serialization and input validation. Depending on their types, Request and Response may be very expressive, e.g. containing processes themselves, linked references to other data, and procedurally generated structure. Or they might be plain old data types. Injection and extraction of data by other languages needn’t be any more difficult than working with JSON or SQL.
  • Simplified transparent persistence. In the aforementioned process model, the Process is easily checkpointed to plain text, and completed Requests are easily recorded to a log. So we can easily use a simple checkpoint persistence model, heuristically updating the checkpoint and clearing the log. This, again, follows from everything being an independently serializable program.
  • Simplified parallelism. Rewriting multiple parts of a program is the default case rather than the a special case that requires a lot of effort to make happen. Because even the most lightweight of parallelism tends to have some overhead, we might still use explicit annotations like `par(P)` to indicate parallel evaluation. However, we don’t need to make special efforts to ‘touch’ that subprogram with the a control-flow cursor. Even when running sequentially, a good evaluator will occasionally yield work on one part of the program to go work on another, both to avoid burning a limited effort quota on an infinite loop and to improve the debugging experience.
  • Simplified distributed computing. Evaluation of a very large program can be divided among multiple processors in a cloud or mesh network. If a node fails, the worst that happens is we need to recompute some subprograms, and we can limit rework by performing occasional checkpoints. Links to big data or parallel/lazy computations can be communicated cheaply, with the data being serialized only when necessary, which makes it easier to pipeline data through subprograms that are mostly doing data-plumbing work. With annotations like `remote(Node,P)` we can potentially support declarative distribution.

A lot of difficult or annoying problems are simplified upon fully embracing `Program → Program` rewriting as our primary evaluation model. Also, I do not believe performance will suffer, not if we dedicate even a moderate fraction as much time to taking program rewriting seriously as we have historically dedicated to imperative programming.

Has the mainstream computing industry hobbled itself by sneering at rewriting? I am starting to thing so.

And I am now giving serious consideration to replacing my project’s existing bytecode with a variant more friendly to rewriting.

Posted in Language Design, User Interface | Tagged , | 7 Comments

Profiling Pure Code

Side effects make profiling simple: at any time, you can just ask for the time now, ask for the time later, and compute the difference. For pure code, this option isn’t readily available. Fortunately, there are other methods for profiling. Performing periodic stack traces can offer a rough but representative view of where a program spends its time. However, I’m not aware of any profiling method that beats the ad-hoc flexibility of stopwatch style profiling with side effects.

Today I had an interesting, simple idea to support stopwatch style timers fully within pure code. The idea relies on opaque ‘timer’ values. Like a stopwatch, our timer can be active or paused. In either case, I can use a simple integer to represent the time values, perhaps with microsecond precision. Here’s a rough sketch of the code (in pseudo Haskell):

newtype ActiveTimer = Active !Int64 -- microseconds since epoch
newtype PausedTimer = Paused !Int64 -- active microseconds elapsed

newTimer :: PausedTimer
newTimer = Paused 0

startTimer :: PausedTimer → ActiveTimer
startTimer = unsafePerformIO . startTimerIO

stopTimer :: ActiveTimer → PausedTimer
stopTimer = unsafePerformIO . stopTimerIO

startTimerIO :: PausedTimer → IO ActiveTimer
startTimerIO (Paused n) = Active . (- n) <$> getTime

stopTimerIO :: ActiveTimer → IO PausedTimer
stopTimerIO (Active n) = Paused . (- n) <$> getTime

unsafeReadTimerIO :: PausedTimer → IO Int64
unsafeReadTimerIO (Paused n) = pure n

getTime :: IO Int64 -- microseconds since epoch

The timers are first-class plain old immutable values. But we’re restricted on how we may safely observe them. The main thing we can safely do with timers is print them as part of a Debug `trace` message.

The purely functional language I’m developing does not have unsafePerformIO. But I do have annotations to address debugging and performance issues, and support for explicit stopwatch style profiling seems a reasonable application. It is trivial for a runtime to provide a local equivalent to newTimer, startTimer, and stopTimer. Rendering resulting timers would require routing through an IO model. Or printing a debug ‘trace’ message of some sort.

Addendum: I am concerned that these timers may hinder caching and structure sharing. I could potentially handle this by simply forbidding timers from contexts where they cause trouble. But I’m reluctant to do so. I might end up favoring second-class named timers, manipulated by annotation.

Posted in Language Design | Tagged , , | 2 Comments

Pure Process Parallelism

For purely functional programming, Haskell-style par/seq parallelism is has a simple API and is flexible. But it assumes you have lightweight threads and scheduling. Unfortunately, par/seq does not work nicely if the overhead for forking a thread or communication between threads rises.

Parallelism may be expensive for a number of reasons. If we want our software to scale to large distributed networks or clouds, for example, there will be some overhead for creating remote threads and any communications between them. My current motivation is lower level: simple memory management. I favor bump-pointer allocation and a compacting collector (a modified Cheney’s algorithm) for reasons of simplicity, sequential performance, robustness and predictability (no fragmentation issues). If I want compacting GC together with parallelism, the easiest solution is to give each thread its own ‘heap’ and to copy data as needed between threads.

Aside: The usual issues with a Cheney’s collector – the doubled address space and stop-the-world latencies – are a non-issue with 64-bit systems and no side-effects with which to observe latency. I also have a way for programmers to tell the runtime to cold-store data that isn’t getting much attention, so we can keep the heap small even when working with big data.

With expensive parallelism, the implicit goal becomes to control creation of and communication between ‘threads’. Existing models have this property – actors, CSP, flow-based programming, RPC, Erlang processes, etc. tend to involve a stable mass of code that receives data and either sends more messages or returns some other form of result. Of course, most existing models are also complected with notions of identity and concurrency.

For purely functional programming, I will extract just the essential feature: the partition between the computation and communication. This is what I came up with:

A process function (PF) might be modeled as an anonymous function more or less of affine type `PF = arg → (result, PF)`.

A process function is just a normal anonymous function, and its results won’t vary based on parallel evaluation. But we are free to annotate it for parallel evaluation. The behavior of this annotation is runtime dependent. For a Haskell system, this might just involve a lightweight par/seq wrapper. For the runtime I’m developing, it might involve lazily allocating a thread with its own mmap’d heap. For a distributed system, it might involve clouds and virtual machines.

When we call our process function with an argument, that may involves communicating the argument to a separate process. We immediately get our handles to the result and the returned process function. Ownership of the allocated heap (if any) is shifted to the returned process function. Thus, the process function becomes a meaningful ‘process’ – able to hold onto data within its own private heap and control how much data is copied between threads, and with an implicit queue of incoming arguments.

The immediate result is a lightweight placeholder. This might be represented typefully (e.g. a `Future result`), but it could be implicit like lazy or parallel values in Haskell. Accessing the actual result will require synchronization. But if we delay synchronization, we’re free to pass that placeholder around and activate other processes and thus achieve parallelism.

These process functions are very flexible. Pipeline parallelism is natural if we orchestrate static networks between parallel processes. Arguments and results, and internal state of processes, can be flexibly incremental. Of course, we’re still limited to purely functional computations – side-effects or networking must be modeled explicitly – but we can achieve high-performance side-effects together with parallelism by batching techniques or separating parallel computation of a large message payload from its header. While we cannot directly alias processes – they have no identity – we can always model process identity and networking indirectly in terms of routing through lookup tables. The runtime implementation is also flexible. An M:N work-stealing model between processes and CPUs/heaps is an obvious choice. A distributed runtime could feasibly migrate processes based on various heuristics – where synchronization is occurring, cost of serializing messages and results, load balancing, etc..

Process functions generalize on par/seq parallelism, in the sense that par/seq essentially would create a fresh process for each evaluation then drop the returned function.

Expressing parallel programs with process functions instead of par/seq should improve scalability to alternative runtime implementations and distributed systems, and give more strategic control over parallelism in general.

Posted in Concurrency, Language Design | Tagged , , , , | Leave a comment

Yield for Effects

The functional programming communities have largely settled on monad interpreters as a basis for modeling effects. However, monadic effects are synchronous by default, which is a disadvantage in many contexts. I would like to encourage exploration of an even simpler effects model in a purely functional context: shared memory.

There has been a recent movement towards “free” monads, which enable separation of the interpreter. The ability to interpret free monads in many different contexts results in more compositional, extensible programs. Also, in my subjective opinion, free monads just feel right – simple, clean, less entangled with their environment, a better fit for the functional programming experience than the opaque monads of the past. In any given step, a free monad either:

  • yields with a requested effect and continuation
  • returns with a final result

That is, free monads are variants of the following:

data Free eff a 
  = forall x . Yield (eff x) (x -> Free eff a) 
  | Return a

Aside: The above representation has horrible performance properties for left-associative monadic binding, fmap, etc.. Fortunately, this is is a solved problem. We need only change our representation of continuations to a tree-structured queue or similar. See Oleg Kiselyov’s code for an example.

A natural consequence is that monadic effects are synchronous by default. Our interpreter must acquire a response of type `x` before continuing. Further, fine grained effects tend to require yielding frequently. This hinders efficient parallelism and batch processing. Of course, we can explicitly model asynchronous effects.

A simple alternative is to model a shared memory as a basis for effects. As a refinement of our free monad:

data Prog shm a
  = Yield shm (shm -> Prog shm a) 
  | Return shm a

Shared memory is a general, securable approach to effects. Our shared memory may consist of stream buffers, message queues, publish-subscribe topic lists, tuple spaces, databases, component entity systems, or something application specific. Upon yielding, our interpreter has opportunity to fill input buffers, flush output buffers, etc.. Buffering is common and explicit within the `shm` type.

Subprograms are easily constrained to operate on fragments of the shared memory. Subprograms that operate on different `shm` types can frequently be integrated with simple lensing techniques. (Adapting monadic effects, which are more eventful in nature, is more challenging in general.) In any case, a lot of the nice properties of free monads are still accessible with shared memory effects.

Shared memory effects are asynchronous by nature, and amenable to efficient batch processing. Yielding introduces an opportunity for effects, not a requirement for them.

Of course, we can explicitly model synchronous effects, where our interpreter knows to wait for certain data to become available before continuing. We can also mix, match, and layer our effects models, e.g. representing Free within Prog or vice versa. So, this isn’t a committed choice.

Some readers might assume shared memory effects are similar to `World -> World` functions, e.g. as used in Racket’s How to Design Programs. However, `World -> World` functions are driven externally, whereas yielding for effects on shared memory gives much greater control to the program.

I have enjoyed use of publish subscribe models, tuple spaces, blackboard metaphor, content addressable networking, explicit message queues, and otherwise more REST-ful approaches to effects common in procedural programming. I haven’t thoroughly explored this idea in context of purely functional programming, where we have constraints against aliasing and greater control over the sharing of memory… but I believe those properties will only improve the experience. I will continue to explore this idea in Awelon project.

Posted in Language Design | 2 Comments

Out of the Tarpit

I recently re-read Out of the Tarpit by Ben Moseley and Peter Marks. This paper describes a strategy for eliminating accidental complexity from software. Some of their conclusions:

  1. input is modeled by outside agents updating relvars
  2. output is modeled by outside agents observing relations
  3. logic is expressed by deriving relations and data
  4. most state is cached computations for performance
  5. no need for control flow, but performance hints OK
  6. avoid structured data abstraction and data hiding

IO is agent driven. Human users are included among these agents. But we may also have software agents that, for example, inject a twitter feed into a relvar, or observe a relation to determine feedback to twitter. Similarly we can represent posting to a forum, uploading a binary, moving a joystick, etc.. This is a very REST-ful approach to software architecture. Latency for output can be mitigated via Comet-style long polling or subscription techniques for the observers.

Modulo caching for performance, there is no hidden state. All inputs to the system are represented in non-derived relvars. Those relvars aren’t necessarily monotonic: we could represent spreadsheet-like behaviors by modifying relations in ad-hoc ways. But many use cases involve time-series data, which allows us to represent more conventional streams and logs and is especially cache friendly.

I find dubious the conclusions regarding structured data abstractions. The authors argue that when unneeded, irrelevant data is provided to a function, it becomes difficult to reason about what changes affect the output. I grant this. However the same argument applies equally to logic that, for example, drops columns or filters rows (aka project or select) from relation arguments.

Interestingly, I feel the authors only accidentally struck a critical point regarding the nature of complexity and reasoning, that algebraic composition is critical for reasoning at very large scales. By algebraic composition I mean a system that includes:

  • a set of composable elements, components
  • a uniform set of composition operators
  • algebraic closure when applying operators
  • compositional properties: invariant or inductive

Compositional properties allow us to reason (both intuitively and formally) about how systems will behave as we grow them, without requiring deep knowledge of the internal structure of the components. Functions and relations both exhibit algebraic composition.

Of course there are many more models out there with useful compositional properties: grammars, object capabilities, algebraic data types, CRDTs, free monads, etc.. By restricting structured data abstraction, it becomes difficult to represent what might be a more appropriate abstraction for a particular use case. 

Contrasting with Awelon project and Wikilon, my dictionary applications rely in a very similar manner on external agents for IO and cached computations as a basis for most application state. But I don’t provide any equivalent to the proposed separation between relvars and derived content. One motive for mixing is: transparent procedural generation and semantic compression, the ability to transparently template data or represent macros and elevate the level of user input without tweaking downstream data processing to receive a new type or source of input.

Also, one of my hypotheses is that computation is fundamentally mechanical in nature, that software is a physical medium like shaping clay or stone. Awelon project takes structure and flow of data to be essential complexity, important to make explicit, though we’re free to abstract it via yet another computation. I’m interested in relational programming within a functional system, but I gravitate towards solutions like µKanren as an EDSL for this purpose. Pure `state→state` functions, especially in context of substructural types and tacit actions seem a good fit for modeling human inputs, which are both mechanical and meaningful.

It will be interesting to see whether the ad-hoc purely functional structure and data abstractions of Awelon project and Wikilon fall afoul of the complexity predicted in this paper, or how much we use relational EDSLs in practice.


Posted in Language Design | Leave a comment

Wikilon Performance Strategy

Performance is a major concern I’ve struggled with over the last year. Peformance is important for a lot of applications, and to a lot of people. Without performance, Wikilon would never mature, never receive the users, attention, and funding it needs to mature into a productive tool.

Without realizing it, I had made an assumption: that I would implement my entire runtime in Haskell. And so I’ve been fighting with the Haskell language and garbage collector, wondering how to implement an efficient JIT when all my data structures are inaccessible in Haskell or VCache.

I’ve recently broken free of this assumption, and now I have a much better strategy for achieving performance. Distilled to a list of bullet points:

  1. Reimplement the core of my runtime in C.
  2. Implement a high performance single-threaded interpreter.
  3. Shared-nothing memory-regions per thread, and thread-local GC.
  4. Efficient copies: coarse-grained reference counting.
  5. Reimplement VCache for larger-than-memory values, specialized for runtime.
  6. Represent dictionaries as ABC values held within this new VCache.
  7. Support for Haskell-style par/seq parallelism.
  8. Leverage LLVM as a basis for compiling code (annotation driven or AOT).
  9. Develop accelerators: recognize subprograms and data structures, optimize.
  10. Focus on accelerators for collections-oriented programming and linear updates.
  11. Leverage structure sharing from VCache to support cached computations.
  12. Support gateway analysis of arguments to statically typed interpreter and compiler.

The potential exists to achieve world class performance while preserving the extremely simple semantics of Awelon Bytecode. Relevant ideas:

  • I can annotate that a list should be represented internally as a vector or a chunked list. I could do this without accelerators, but it wouldn’t offer much benefit. (Maybe it would improve memory locality.) With accelerators, many functions on lists – indexing a list, reversing a list, list length, appending two lists, splitting a list – can be reduced to a single opcode. Alternative list representations, then, would allow far more efficient implementations of these generic list functions: logical reversals, O(1) length, cheap splitting. This is incredibly useful for collections-oriented programming. (Specialized vectors for binaries would be an obvious extension.)
  • With reference counted vectors, or ‘affine’ vectors (from lists terminating with `[]f`), I can recognize vectors having just one reference. I can easily accelerate a “swap at list index” function. When performed on a vector with just one reference, I can transparently implement this as a O(1) update instead of copy-on-write. Similarly, I can optimize “split” and “append” operations such that when I ‘append’ on the same lines that I previously ‘split’ on, I can seamlessly rejoin the result into one larger vector. O(1) vector update is convenient for a lot of algorithms like union-find. O(1) split and rejoin is useful for many parallelism strategies. Developers may use ‘affine’ properties to enforce this discipline, but they aren’t required to do so.
  • If a subprogram is mostly written in terms of list processing and variants (e.g. matrix processing), then it may potentially be implemented on a GPGPU. We could annotate these subprograms, and automatically compile them for use on the GPGPU (e.g. via OpenCL or CUDA), or raise an error if the program isn’t understood. This technique would allow us to semi-transparently switch from CPU vector processing to GPU vector processing. Besides GPGPUs, we might get weaker benefits leveraging SIMD instructions on a CPU.
  • Coarse-grained reference counting means I don’t have a reference-per-value. Instead, I inject reference counts as distinct nodes in a tree-structured value. Whenever I try to read through such a node, I copy the values down to the next reference count nodes, which I increment. I can also heuristically inject more reference count nodes. Anyhow, this means the interpreter can treat most values as though it has exclusive access, modify things in place, and immediately mark garbage.
  • The specialized re-implementation of VCache promises to be both simpler and more efficient than my original. I implicitly batch my writes by having multiple latent requests in a memory region. Actual storage can wait until memory is tight, or a computation finishes. Transient values – e.g. those introduced while applying a stream of updates to a database-value – will not be written. Because storage is entirely transparent (no `VRef` type), I can reject storage of small values and obtain coarse-grained values that are good compression targets. (I plan to use Zstandard compression for large nodes.) If memory is full and there are no values indicated for storage, I can heuristically pick a few rather than resize memory.
  • Memory limits have been a major hindrance on performance scalability of purely functional systems. The ability to transparently model larger-than-memory values means we don’t need to escape pure functions to interact with an external database. We don’t need separate parallelism strategies for the IO. We can perform partial evaluations and constant propagation on a much grander scale. Databases become plain old values, no separate ‘management system’ required. To fully leverage these large values – e.g. so we can incrementally update and query a database as part of a command pattern dictionary application – we will need to leverage caching. We can cache partially evaluated code or common computations. Structure sharing aides in caching or memoizing computations on very large values: they’ll produce the same identifier when recomputed.

I am excited about the possibilities here. With access to GPGPUs and efficient collections processing, it is feasible to model physics simulations, machine learning, and more. The ability to do this within the framework of Wikilon gives us convenient spreadsheet-like characteristics, the ability to tweak parameters and recompute, the ability to fork a simulation and observe multiple alternative progressions side-by-side. With this performance, we can use Wikilon for real-time tasks by pumping time-series data into the dictionary and computing a proper set of actions.

This strategy does face some challenges.

For effective GPU accelerated computations, we’ll need to recognize functions not just for list processing, but also for a model of IEEE floating point numbers and fixed-width integers (arithmetic modulo 2^32). Floating point numbers are an awful model, so this will be painful. Developing accelerators isn’t too difficult in a technical sense (just recognize substrings of bytecode, substitute for an internal opcode and hand-written function) but it would be very nice to *prove* correctness of the substitution, and that’s probably a lot more difficult. And recognizing subprograms that can be compiled to the GPU can be done in a simplistic way at first (failing except for programs that obviously fit), but there will be pressure to accept the widest possible set of programs.

Further, it’s important to limit how much complexity we accept into our interpreters. Representing lists as vectors or even chunked lists thereof is reasonable and simple enough to implement. Representing lists as finger-tree ropes, OTOH, is far more dubious: it requires a significant amount of logic. Also, it wouldn’t offer significant benefits compared to modeling finger-trees above lists-as-vectors. An ABC runtime written in C has the advantage that we can easily reuse it outside of Wikilon, but we don’t want it to become bloated or difficult to re-implement.

But these challenges are surmountable: a lot of work, details to hammer out, but nothing new. I will probably study Haskell’s ‘accelerate’ for GPGPU computing, for example.

Posted in Language Design | 3 Comments