Data Model Independence

One of the most pervasive and stubborn challenges for open systems development is composing, fusing, adapting, integrating subsystems that use heterogeneous data models for similar tasks. By data model I mean how we group, share, describe, and update data in a system – similar to a SQL or XML schema, or a domain specific language. This includes modeling users and domain objects. A data model is usually associated with a data set describing a subsystem.

Why is this difficult? Let me count the ways:

  1. scatter-gather: A data model tends to group related blocks of data (e.g. into tables or records). However, a different data model will often group data differently. Thus, a lot of scatter/gather work is necessary to translate from one model to another.
  2. incremental: The data models we’re most interested in composing tend also to be in the process of changing. For ‘small’ data sets, we can translate the entire model after every update (i.e. batch processing). However, as the system and data sets scale (which happens to be a natural consequence of composing them!) the cost of translating the whole model grows and the rate of updates increases. Thus, for systems of even moderate size, it becomes is essential to support incremental updates. Unfortunately, incremental computation is not generally composable – it is impossible to define a function of the form: ((s -> s) -> (s -> t) -> (t -> t)). Thus, to achieve the incremental property across compositions will require we constrain how models and updates are expressed.
  3. bi-directional: most interesting data models and views are bi-directional, i.e. the client expects to influence the model (e.g. via updates, commands, constraints, and demands) and the world. Translating and fusing data-models will inherently require translating and fusing their control paths. Updates are subject to the same scatter-gather efforts as the data being updated, and must similarly maintain the incremental composition properties.
  4. queries and views: it is infeasible to load and process a large data set every time we have a question. Doing so would often be inefficient even for small data sets. So our data models must structure the data to support common queries; the goal is to load and process only the subset of the model that might contribute to the answer. When we compose or translate data-models, we must maintain this property, loading no more of each source model than necessary. In a sense, a model corresponds to an effectful function of query->view. An interesting possibility is for queries and views to be the same sort of thing (or easily translated), thus allowing for sequential composition of models (where the view from one data model becomes a query or control operation on the next).
  5. lossy: A number in one data model might correspond to a combination of numbers from another. A word might have a subtly different meaning. This lossy nature affects translations in one direction and updates in the other, requires us to embed heuristics and best-effort strategies. These informal aspects hinder automation for developing translation code. This lossiness also discourages code reuse – i.e. if we have a lossy A->B adaptation, and a lossy B->C adaptation, we might want a dedicated lossy A->C adaptation to reduce loss to the middle-man. I don’t know how to avoid lossy translations, though I believe the problem can at least be mitigated with extensible data types and systematic use of ‘side-channels’ (e.g. to propagate meta-data or context, similar to footnotes).
  6. concurrency and consistency: data models in open systems will be subject to concurrent updates – i.e. multiple users, sensors, and dependencies on other concurrent data models. This results in a lot of concurrent, incremental updates propagating through the system. These updates will tend to propagate at different rates to different observers (due to thread scheduling, variable network latency, and different intermediate data models). Unfortunately, incremental updates are not generally commutative; naive use of update arrival-order would result in as many ‘models’ as there are observers. So the challenge here is to support control this concurrency, in order to ensure incremental updates are processed consistently and predictably, and in both directions. Whatever concurrency-control mechanism we choose must compose and scale as we integrate more data models and subsystems together. (And we really cannot tolerate any concurrency model that would subject us to denial-of-service or priority inversion.)
  7. disruption and resilience: incremental updates won’t make much sense to a new observer, nor to one who has missed an update due to network failure. Under such circumstances, we want our system to be resilient, enabling a disrupted or new observer to ‘catch up’ to the same consistent views observed by everyone else. This might be achieved by keeping a history of updates, or by favoring a more RESTful approach where we share the ‘current state’ of a model. For long-lived systems it will be important to bound the amount of history we process, though a hybrid of RESTful snapshots and incremental updates is certainly acceptable. If we bound the amount of history, then intermediate views and models cannot be allowed to accumulate state because, if they did, the system would tend towards inconsistency due to each view’s particular startup time and history of disruptions. In addition to needing a composable mechanism for resilience, we must be able to easily recognize disruption or communication failure.
  8. security model and user privileges: in an open system, we usually don’t expose a data model (especially one that controls things) to any arbitrary user. We are constrained in our ability to compose data models by our security model. I favor the object capability model because it’s the only one I know of that allows ad-hoc composition without requiring a lot of special privileges. I might need a separate article to fully explain this.

Despite being a monumental problem, data model independence is essential. Heterogeneous data models will always exist. First, there is the essential sort of heterogeneity we get from describing and controlling different domains (e.g. music and rhythm analysis vs. robot kinematics) and composing them in unanticipated ad-hoc fashions (e.g. make a robot play the drums or dance). Second, there is essential heterogeneity we get for innovation within domains, such as supporting new classes of sensors, actuators, payloads, protocols, and platforms; any ‘standard’ we create will accommodate only a subset of what we need in the future. Third, the world is simply too big for an architect to grok – i.e. I will guarantee that any proposed standard for robot control will have failed to effectively accommodate at least one existing payload or device.

When language designers ignore this issue, they’re just leaving a difficult challenge to regular developers to solve incompletely and repeatedly.

Effective, secure, and scalable support for composing heterogeneous data models has been a desiderata on my language design efforts for many years, and heavily influenced the development of reactive demand programming. I won’t claim a panacea, but RDP will significantly ease the burden on developers when composing heterogeneous data models.

This entry was posted in Concurrency, Distributed Programming, Language Design, Live Programming, Modularity, Open Systems Programming, Reactive Demand Programming and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s