Performance is a major concern I’ve struggled with over the last year. Peformance is important for a lot of applications, and to a lot of people. Without performance, Wikilon would never mature, never receive the users, attention, and funding it needs to mature into a productive tool.
Without realizing it, I had made an assumption: that I would implement my entire runtime in Haskell. And so I’ve been fighting with the Haskell language and garbage collector, wondering how to implement an efficient JIT when all my data structures are inaccessible in Haskell or VCache.
I’ve recently broken free of this assumption, and now I have a much better strategy for achieving performance. Distilled to a list of bullet points:
- Reimplement the core of my runtime in C.
- Implement a high performance single-threaded interpreter.
- Shared-nothing memory-regions per thread, and thread-local GC.
- Efficient copies: coarse-grained reference counting.
- Reimplement VCache for larger-than-memory values, specialized for runtime.
- Represent dictionaries as ABC values held within this new VCache.
- Support for Haskell-style par/seq parallelism.
- Leverage LLVM as a basis for compiling code (annotation driven or AOT).
- Develop accelerators: recognize subprograms and data structures, optimize.
- Focus on accelerators for collections-oriented programming and linear updates.
- Leverage structure sharing from VCache to support cached computations.
- Support gateway analysis of arguments to statically typed interpreter and compiler.
The potential exists to achieve world class performance while preserving the extremely simple semantics of Awelon Bytecode. Relevant ideas:
- I can annotate that a list should be represented internally as a vector or a chunked list. I could do this without accelerators, but it wouldn’t offer much benefit. (Maybe it would improve memory locality.) With accelerators, many functions on lists – indexing a list, reversing a list, list length, appending two lists, splitting a list – can be reduced to a single opcode. Alternative list representations, then, would allow far more efficient implementations of these generic list functions: logical reversals, O(1) length, cheap splitting. This is incredibly useful for collections-oriented programming. (Specialized vectors for binaries would be an obvious extension.)
- With reference counted vectors, or ‘affine’ vectors (from lists terminating with `f`), I can recognize vectors having just one reference. I can easily accelerate a “swap at list index” function. When performed on a vector with just one reference, I can transparently implement this as a O(1) update instead of copy-on-write. Similarly, I can optimize “split” and “append” operations such that when I ‘append’ on the same lines that I previously ‘split’ on, I can seamlessly rejoin the result into one larger vector. O(1) vector update is convenient for a lot of algorithms like union-find. O(1) split and rejoin is useful for many parallelism strategies. Developers may use ‘affine’ properties to enforce this discipline, but they aren’t required to do so.
- If a subprogram is mostly written in terms of list processing and variants (e.g. matrix processing), then it may potentially be implemented on a GPGPU. We could annotate these subprograms, and automatically compile them for use on the GPGPU (e.g. via OpenCL or CUDA), or raise an error if the program isn’t understood. This technique would allow us to semi-transparently switch from CPU vector processing to GPU vector processing. Besides GPGPUs, we might get weaker benefits leveraging SIMD instructions on a CPU.
- Coarse-grained reference counting means I don’t have a reference-per-value. Instead, I inject reference counts as distinct nodes in a tree-structured value. Whenever I try to read through such a node, I copy the values down to the next reference count nodes, which I increment. I can also heuristically inject more reference count nodes. Anyhow, this means the interpreter can treat most values as though it has exclusive access, modify things in place, and immediately mark garbage.
- The specialized re-implementation of VCache promises to be both simpler and more efficient than my original. I implicitly batch my writes by having multiple latent requests in a memory region. Actual storage can wait until memory is tight, or a computation finishes. Transient values – e.g. those introduced while applying a stream of updates to a database-value – will not be written. Because storage is entirely transparent (no `VRef` type), I can reject storage of small values and obtain coarse-grained values that are good compression targets. (I plan to use Zstandard compression for large nodes.) If memory is full and there are no values indicated for storage, I can heuristically pick a few rather than resize memory.
- Memory limits have been a major hindrance on performance scalability of purely functional systems. The ability to transparently model larger-than-memory values means we don’t need to escape pure functions to interact with an external database. We don’t need separate parallelism strategies for the IO. We can perform partial evaluations and constant propagation on a much grander scale. Databases become plain old values, no separate ‘management system’ required. To fully leverage these large values – e.g. so we can incrementally update and query a database as part of a command pattern dictionary application – we will need to leverage caching. We can cache partially evaluated code or common computations. Structure sharing aides in caching or memoizing computations on very large values: they’ll produce the same identifier when recomputed.
I am excited about the possibilities here. With access to GPGPUs and efficient collections processing, it is feasible to model physics simulations, machine learning, and more. The ability to do this within the framework of Wikilon gives us convenient spreadsheet-like characteristics, the ability to tweak parameters and recompute, the ability to fork a simulation and observe multiple alternative progressions side-by-side. With this performance, we can use Wikilon for real-time tasks by pumping time-series data into the dictionary and computing a proper set of actions.
This strategy does face some challenges.
For effective GPU accelerated computations, we’ll need to recognize functions not just for list processing, but also for a model of IEEE floating point numbers and fixed-width integers (arithmetic modulo 2^32). Floating point numbers are an awful model, so this will be painful. Developing accelerators isn’t too difficult in a technical sense (just recognize substrings of bytecode, substitute for an internal opcode and hand-written function) but it would be very nice to *prove* correctness of the substitution, and that’s probably a lot more difficult. And recognizing subprograms that can be compiled to the GPU can be done in a simplistic way at first (failing except for programs that obviously fit), but there will be pressure to accept the widest possible set of programs.
Further, it’s important to limit how much complexity we accept into our interpreters. Representing lists as vectors or even chunked lists thereof is reasonable and simple enough to implement. Representing lists as finger-tree ropes, OTOH, is far more dubious: it requires a significant amount of logic. Also, it wouldn’t offer significant benefits compared to modeling finger-trees above lists-as-vectors. An ABC runtime written in C has the advantage that we can easily reuse it outside of Wikilon, but we don’t want it to become bloated or difficult to re-implement.
But these challenges are surmountable: a lot of work, details to hammer out, but nothing new. I will probably study Haskell’s ‘accelerate’ for GPGPU computing, for example.