Large Scale Reactive Collections

A common challenge for reactive programming models is large collections with fine-grained mutations – e.g. arrays, lists, stacks, maps. Subscriptions tend to be ad-hoc, e.g. “the third element of the array” or “the top element of a stack”, or “the average of elements in the list”.

The common, simple approach to this challenge is to use coarse-grained persistent data structures to represent the collections, then filter them per each observer. The persistent structure can support consistency, insofar as each observer sees the same intermediate states even if another update is involved. This works very well for small to medium sized collections, and also works extremely well in cases where most of the collection is both updated and relevant to every observer (such as: pixels in a video frame).

But, as we scale collection size, this simple approach tends to have scaling problems:

  • In many common use cases (such as databases), the update rate for individual elements of a collection is based on external factors and more or less independent of absolute collection size. Therefore, a larger collection implies proportionately more updates.
  • Meanwhile, individual observers tend to be interested in a logarithmic slice or view of the collection, and thus a proportionately smaller fraction of the updates are relevant to a given observer.
  • Further, collections soon grow beyond the CPU cache, and eventually beyond the memory arena (at which point it starts spreading to hard disk or network), and thus observing fragments of the collection that did not change can easily thrash resources.

At scale, it becomes important to isolate changes of interest to the observer. Developers typically accomplish this by breaking the collections into multiple smaller chunks, then observing just those chunks that contain the relevant data. However, this often trades one set of problems for another:

  • The number of subscriptions grows as chunk size decreases. Ideally, the number of subscriptions is something like O(E*lg(N)) where ‘E’ is the number of eyes (aka observers) on the data, and ‘N’ is the number of chunks. Unfortunately, in practice (in my experience), the ad-hoc nature of subscriptions interferes: the chunking needed by some observers is orthogonal to the chunking provided by the developers, and the result is often O(E*N).
  • With many subscriptions, there are equivalently more update events, which leads to subtle consistency issues: how does the system guarantee that every observer sees updates in the same logical order? sees the same intermediate states?
  • Mutators must often update multiple chunks simultaneously in order to ensure consistency, which leaves an open question of how to ensure observers see these changes as simultaneous (and thus glitch-free).
  • There is a new problem of properly ‘dividing’ ad-hoc query and mutator expressions so they touch the right chunks.

The advantage of this trade is that this new set of problems is solveable, assuming developers can be convinced to part ways with their imperative programming models. A language designer can tackle the issues of dividing queries and mutations between chunks. Logical simultaneity and consistency is possible by use of a temporal model. (Transactions aren’t good enough because they don’t ensure different observers see consistent intermediate states.)

The ‘wrong chunking’ problem is tackled by use of intermediate data models and shared views. I.e. if we assume that most (e.g. 60%) observers are well-served by the default data model, then it stands to reason that most (e.g. another 60%) observers would be well served by an alternative representation of the same data, and so on. If this pattern continues, then maintaining just four data models would cover 97% of use cases. (This is not unrealistic; if anything, 60% is a bit conservative.) Of course `0.97*E*lg(N) + 0.03*E*N` is still O(E*N), but we can add another data model, and another, and another… as many as we need.

The cost structure should ultimately reduce to O(E*lg(N) + N*lg(E)), which is a globally scalable cost model. Of course, there will certainly be redundant efforts between observers in building these intermediate data models, but assuming we can also tame that redundancy to a logarithmic quantity then our total costs should be O(E*lg(N) + N*lg2(E)), which is still acceptable for global scale data processing.

Controlling redundancy of these intermediate data models – on a global scale – raises a whole slew of issues such as open systems, composition, modularity, service and resource discovery, disruption tolerance, security and webs of trust. I discuss in an earlier article a few of the various issues of surrounding independent data models (for inter-dependent data). Cross-cutting concerns tend to be deeply intertwined. But I didn’t write this article to talk about modularity.

Reactive Memory Models

Memory – and I mean in the sense of RAM and virtual memory, addressed rather than associative – is, essentially, one big mutable collection, albeit with tightly constrained data type (e.g. fixed-width integers) and structure.

Interactions with memory tend to be rich and ad-hoc: allocation, initialization, shared memory mutation, eventual recycling to conserve limited memory resources. An essential feature of addressed is that the content type can also serves as an addressing type. This becomes the basis for modeling of ad-hoc data, the notion of indirection and indices and cost modeling, and so on.

True memory access, even true mutation, is completely unnecessary to benefit from a modeling memory. For example, Oleg Kiselyov uses an IntMap as a simple memory model to represent double-linked lists in Haskell, i.e. modeling mutation in a pure ‘heap-passing’ style.

After writing my last article, about reactive state transition (RST), I got to thinking about reactive memory models – which ultimately what prompted this article.

The idea is roughly:

  1. RST is used to model the state of each ‘cell’ in memory.
  2. Every RST cell contains an integer, possibly of fixed width.
  3. RST cells are identified within a conceptually eternal directory structure.
  4. Therefore, to model memory, the directory structure – its path type – is also an integer. Developers are able to interpret some integers as values and others as pointers, based on context or possibly by use of ‘tag’ bits (e.g. even is integer, odd is pointer). All the normal idioms apply.

I don’t inflict a ‘flat’ memory model on developers, since that’s mostly inconvenient. Developers can take advantage of sub-directories to represent objects or arrays (which would then be ‘growable’ without conflicts in the parent’s address space), or use them as intended – to securely partition an infinite space to prevent name conflicts. (In a sense, that would be like process address spaces.)

Developers thus have options when representing things like ‘lisp’ cells. They could either use adjacent addresses, e.g. ‘4’ represents cells 4 and 5, or they might use ‘4’ to represent the first element and subdirectory 4/0 to represent the second.

Between Reactive State Transition and Reactive Demand Programming, many important issues are resolved: ad-hoc queries, consistent views, consistent multi-cell updates, conflict and collaboration between writers, even ‘continuous’ writes and maintenance, and ‘anticipation’ of future state.

Efficiency ultimately relies upon developers modeling (and observing) relatively stable indexes, such as B-trees or BSP trees, to support multiple views of data. State of the root nodes would change more slowly than the leaves, and thus cause fewer updates to propagate back to the observer.

It would be inconvenient (and probably inefficient) to shoehorn everything into an integer-based memory model. I imagine that, much like the IntMap-based double-linked list, developers would use the memory model where it’s convenient and use integers to identify some other resource (maybe a tuple-space or demand monitor) to cover feature gaps.

This entry was posted in Concurrency, Language Design, Reactive Demand Programming, State and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s