Pure Process Parallelism

For purely functional programming, Haskell-style par/seq parallelism is has a simple API and is flexible. But it assumes you have lightweight threads and scheduling. Unfortunately, par/seq does not work nicely if the overhead for forking a thread or communication between threads rises.

Parallelism may be expensive for a number of reasons. If we want our software to scale to large distributed networks or clouds, for example, there will be some overhead for creating remote threads and any communications between them. My current motivation is lower level: simple memory management. I favor bump-pointer allocation and a compacting collector (a modified Cheney’s algorithm) for reasons of simplicity, sequential performance, robustness and predictability (no fragmentation issues). If I want compacting GC together with parallelism, the easiest solution is to give each thread its own ‘heap’ and to copy data as needed between threads.

Aside: The usual issues with a Cheney’s collector – the doubled address space and stop-the-world latencies – are a non-issue with 64-bit systems and no side-effects with which to observe latency. I also have a way for programmers to tell the runtime to cold-store data that isn’t getting much attention, so we can keep the heap small even when working with big data.

With expensive parallelism, the implicit goal becomes to control creation of and communication between ‘threads’. Existing models have this property – actors, CSP, flow-based programming, RPC, Erlang processes, etc. tend to involve a stable mass of code that receives data and either sends more messages or returns some other form of result. Of course, most existing models are also complected with notions of identity and concurrency.

For purely functional programming, I will extract just the essential feature: the partition between the computation and communication. This is what I came up with:

A process function (PF) might be modeled as an anonymous function more or less of affine type `PF = arg → (result, PF)`.

A process function is just a normal anonymous function, and its results won’t vary based on parallel evaluation. But we are free to annotate it for parallel evaluation. The behavior of this annotation is runtime dependent. For a Haskell system, this might just involve a lightweight par/seq wrapper. For the runtime I’m developing, it might involve lazily allocating a thread with its own mmap’d heap. For a distributed system, it might involve clouds and virtual machines.

When we call our process function with an argument, that may involves communicating the argument to a separate process. We immediately get our handles to the result and the returned process function. Ownership of the allocated heap (if any) is shifted to the returned process function. Thus, the process function becomes a meaningful ‘process’ – able to hold onto data within its own private heap and control how much data is copied between threads, and with an implicit queue of incoming arguments.

The immediate result is a lightweight placeholder. This might be represented typefully (e.g. a `Future result`), but it could be implicit like lazy or parallel values in Haskell. Accessing the actual result will require synchronization. But if we delay synchronization, we’re free to pass that placeholder around and activate other processes and thus achieve parallelism.

These process functions are very flexible. Pipeline parallelism is natural if we orchestrate static networks between parallel processes. Arguments and results, and internal state of processes, can be flexibly incremental. Of course, we’re still limited to purely functional computations – side-effects or networking must be modeled explicitly – but we can achieve high-performance side-effects together with parallelism by batching techniques or separating parallel computation of a large message payload from its header. While we cannot directly alias processes – they have no identity – we can always model process identity and networking indirectly in terms of routing through lookup tables. The runtime implementation is also flexible. An M:N work-stealing model between processes and CPUs/heaps is an obvious choice. A distributed runtime could feasibly migrate processes based on various heuristics – where synchronization is occurring, cost of serializing messages and results, load balancing, etc..

Process functions generalize on par/seq parallelism, in the sense that par/seq essentially would create a fresh process for each evaluation then drop the returned function.

Expressing parallel programs with process functions instead of par/seq should improve scalability to alternative runtime implementations and distributed systems, and give more strategic control over parallelism in general.

This entry was posted in Concurrency, Language Design and tagged , , , , . Bookmark the permalink.

Leave a comment