Exponential Decay of History, Improved

In an earlier article, I described exponential decay of history as an alternative to ring buffers. In this article, I’ll provide a full recap and an improved design.

Exponential decay of history is a pattern that competes with ring-buffers, least-recently-used heuristics, and other techniques that represent historical information in a limited space. Like ring buffers, exponential decay can keep useful historical information in a bounded space. However, ring buffers provide a flat history, e.g. the last ten elements (perhaps corresponding to a few seconds), which means they can quickly lose historical context (such as what was happening minutes ago, hours ago, days ago) that may be very valuable for making useful decisions. Exponential decay provides a deep history – potentially keeping years or centuries of context in a bounded volume. The tradeoff is losing much of the intermediate information.

Exponential decay essentially models information with a half-life, where the half-life can be measured in terms of a frame count (or an event count, though I disfavor events). For example, if the half-life is 50 frames, then a buffer of 1050 frames will keep historical context for twice as long as a buffer of 1000 frames. At 30 Hz, half-life 50, buffer 1000, the system will keep information for roughly 2^(1000/50)/30 seconds = 10 hours. If we bump the buffer to 2000 frames, it could keep one thousand years. (Aside: Users don’t need to cap the buffer size. If they don’t, then they lose bounded-space and real-time guarantees, and performance will predictably decay logarithmically with the system’s lifetime.)

When we model information with a half-life, it is essential to decide which half to keep. Ultimately, we need a ThunderGrid loss function: two frames will enter, one will leave. The loss function can be modeled in many ways:

  1. We can blindly keep one of the two frames, e.g. the second one (`const const`).
  2. We can select one frame by some preference properties, e.g. the longer-lived frame, or the more interesting one.
  3. We can merge two frames. E.g. perhaps keeping a summary of counts, minima, and maxima recorded in frames. It’s important that the size of the merged frame is about the same size of the frames being merged, i.e. bounded space, or an approximation of it. (If we concatenated frames, we would effectively keep a complete history.)
  4. We can perform possibilities 2 or 3 within some context – e.g. a record of younger and older frames, perhaps some state. This context can help improve decisions about which information is redundant and which is interesting.

Which loss function we need will depend on the application. By tuning the loss-function, half-life, and buffer size (and data type, of course), users of exponential decay of history can accommodate the needs of almost any application. Exponential decay of history has an enormous range of potential applications. Consider:

  • system logs – logs (for services, operating systems, etc.) are often modeled in ring-buffers, e.g. a ring of two or three files that each keep up to 10MB of log events. The ring-buffer model is effective for bugs that occur at a higher frequency than the buffer is replaced, but exponential decay could efficiently enable diagnosis of very low-frequency or rare events.
  • video game save files – Today, people keep a few saves and tend to round-robin between them. Unfortunately, this can lead to a frustrating situation where players get ‘stuck’ – i.e. where all their saves are in a bad spot. Exponential decay could easily address this: each save slot is a ‘frame’, and (with a small half-life) the history could easily run the whole length of a game (even hundreds of hours) with just 10-20 saves. Players could then load from earlier points without difficulty. Usefully, this is also compatible with automatic saves.
  • memoization and caching – a simple technique to improve performance of many systems is to remember previous calculations (memoization) or to remember recent communications (caching). But memoization and caching both introduce problems of: how long do we keep the representation in memory? Keeping all representations is generally a waste of resources because there is no guarantee we’ll reuse a value. Common bounded-space techniques, such as round-robin or least-recently-used, enable easy (even accidental) construction of cases where they behave poorly. Exponential decay is a very nice alternative. Each frame represents a lookup * representation pair. We add a new frame only if it does not exist in the history. Frequent frames will naturally have more opportunities, and thus greater probability, of reaching the higher-numbered frames positions where they will be remembered for longer. Lower-numbered frames would generally correspond to the most recent lookup operations. Thus, over time, the cache achieves a valuable balance of the old-but-common tasks and the recent tasks.
  • document history and undo – Documents tend to undergo lots of small edits in short bursts. Providing those minute edits in the short term is useful for robust ‘undo’, whereas a coarse-grained history is valuable for long-term records. This aligns well with exponential decay of history, with each frame recording document state. Every document could be kept with a deep history in a bounded space that’s no more than a constant factor larger than the largest size reached by the document.
  • debugging with history – Debuggers in traditional procedural languages traditionally provide the current stack of operations, which offer some hints about what happened but aren’t so useful for recording callbacks. The stack really is not about history, but about the future – what is the the program going to do next? This becomes obvious when languages are implemented in a continuation passing style (CPS) or with the tail call optimization (TCO). These styles can hide even the small amount of historical context offered by a stack. I have even seen flame wars involving people objecting to CPS and TCO because of how much it hinders debugging. The history is important. One approach to address the issue is to keep a history more explicitly – e.g. recording activations of function-calls with their arguments. But keeping history is expensive, so we limit the space. People solving this in the past have used a ring-buffer. Unfortunately, an inner loop deep in some algorithm can easily overrun the whole ring-buffer, so we still lose valuable history. If we instead used exponential decay to record activations, we’ll potentially achieve much greater access to historical contexts. But we can do better! Instead of function call events each frame could be a snapshot of the program state. This would enable time-traveling debuggers to keep long histories in bounded space, and thus making them usable even in deployed systems.
  • long-running statistics and attributes – There are many reasons we might wish to keep running statistics in a system, whether for performance profiling, world modeling, or diagnosis. Useful stats often include minima, maxima, counts, running totals, sums of squares, and so on. For exponential decay of history model, each frame is a record of statistics, and the loss-function is trivial composition: the min of minima, the max of maxima, the sums of counts or sums. In addition to the normal statistics, each frame can be indexed with the period it covers, i.e. a min-time and a max-time. Basically, the loss-function is simply losing information about differences on the timeline. With a sufficiently large half-life and buffer, we can plot beautiful graphs.
  • live dashboards – a lot of application software is all about keeping humans aware of useful information in real-time. But humans are interrupted, take coffee breaks, eventually need to sleep. In some cases (e.g. for security systems, military intelligence, or operator control of unmanned systems) someone else will need to take over. After returning from a break, or when taking over an intelligence station, there is one very common question: Catch me up. What did I miss? Naturally, nobody has the time to answer such a question in full detail. So it is naturally addressed with exponential decay. Potentially, we could add a ‘rewind’ option (perhaps presented as a scrollbar) to every user-interface, enabling users to quickly observe how systems their state have evolved in their absence, even to get the fast-motion replays.
  • robust extension of stateful systems – there are many models for stateful systems: mutable imperative variables, state machines and their hundred variations, folds and integrals, term rewriting, process calculi, actors model, relational databases, object databases, and so on. Unfortunately, most state models indirectly lead to software systems that are fragile under common conditions: maintenance, upgrade, extension, disruption. Software is specialized to maintain only the state that it needs. But those needs change. Software is developed with expectation of reliable messaging, observing every event. But then it is hibernated, frozen for debugging, disconnected. Ensuring state is non-local and observations are eventless will help a great deal, but we also want stateful extensions to (metaphorically) hit the ground running. A useful technique is to model an extension’s state as a function of the input history of the parent system, reducing the burden for specialized state per extension. Of course, maintaining and processing the entire history of inputs would be an impractical burden. But if we also address disruption tolerance, the extensions should have little trouble handling the approximate history reported through exponential decay.
  • machine learning – An interesting goal for a system that maintains history with exponential decay is to recover the complete history, or at least a good re-enactment of it. In this case, our loss function would have a goal of compressing the interesting and useful bits from the two frames into one frame, i.e. such that we can recover both frames. By ‘interesting’, I mean that you would not have predicted the value (and thus cannot recover it without recording it). By ‘useful’, I mean that it has predictive value noise, i.e. information that is difficult to compress and has low utility. Access to context (e.g. before and after frames) will be essential for computing what information is predictable in context. I believe that developing a good loss function by hand is infeasible, that we humans won’t be able to explain to a computer which details are useful, and which are predictable. But this is a suitable task for machine learning. In particular, it’s good for unsupervised machine learning (because we can easily test our ability to recover frames), and it’s good for deep machine learning (because our loss function will often be combining frames that have been combined before, possibly at different knowledge depths, thus necessitating multiple ‘layers’ of semantic knowledge and implying cross-layer training). If we develop it carefully, the resulting system will also support both classification (of objects AND actions), and what if? questions (via fictionalizing the context). (By my estimate, machine learning and exponential decay should be very symbiotic, with the history providing the context necessary for intelligent deductions.)

Exponential decay is well aligned with real world evidence models. That is, the real world records a lot of evidence for recent events, and progressively less evidence for past events. Since the amount of ‘state’ in the world is finite and does not grow over time (modulo occasional meteoric impact), the evidence must decay exponentially. (It just doesn’t decay at a uniform exponential rate.) I believe it is also well-aligned with intelligence in animals and humans, since the state and computational resources of a human brain is finite and bounded.

Implementing Exponential Decay of History

Exponential decay of history is one of those rare, great, simple ideas that could revolutionize software development.

Unfortunately, my prior article made it more complicated than it needed to be. In this article, I describe a simpler implementation here, with the following improved qualities:

  • The state of the history can be modeled as a simple list (instead of a stack of ring-buffers). Manipulations and configuration are simple and uniform.
  • The half-life can be tuned arbitrarily. Users can provide it as a natural number, which is reduced to a pseudo-random value from the exponential distribution that indicates which of frames should be collapsed (if any) when adding a new frame.
  • Because frames are collapsed in a probabilistic manner, the resulting model is robust against temporal aliasing. (The prior design could lose valuable information if a condition naturally aligns with periodic loss events. The new design will suffer aliasing only if it exists in the initial input, i.e. based on sensor framerates.)

Warning: The following model is non-type-checked pseudo-Haskell. It’s also a pure model for history, which might not be the best option if dealing with very large values or supporting persistence.

 -- a little context...
import qualified System.Random (StdGen)
import qualified Control.Exception (assert)

type RNG = StdGen
exponentialDist :: Integer -> RNG -> (Integer,RNG)
exponentialDist halfLife rg  = (irg',index) where
    -- implementation assumed

-- pure history with support for stateful context 
data History s a = History
    { h_rand :: !RNG -- pure random numbers
    , h_loss :: !(HLossFn s a) -- see below!
    , h_hist :: ![a] -- the frame buffer 
    , h_state :: !s -- extra state and context
    , h_halflife :: !Integer -- approximate frame count to lose 1/2 info
    , h_maxlen :: !Integer -- how much history to keep?
    }

-- The Loss-function for a history:
--
--    state -> lowerFrames -> upperFrames 
--          -> aNewer -> aOlder -> (state', aMerged)
-- 
-- This loss function gets access to the historical context 
-- (state, lower and upper frames) along with the two frames it
-- is to smash together. It outputs an updated state and a new
-- combined frame. (I.e. the loss function cannot affect the
-- surrounding frames, but may influence state.)
--
--   state: arbitrary, user-defined, but really should have bounded
--      space. Might contain ML models, or recursively more history.
--   lowerFrames: Ordered from latest to earliest (i.e. list zipper style)
--      all younger than aNewer
--   upperFrames: Ordered from earliest to latest; all older than aOlder
--   aNewer: the younger of the two frames being merged
--   aOlder: the older of the two frames being merged
--   state': the updated state.
--   aMerged: the result of combining. 
--
type HLossFn s a = s -> [a] -> [a] -> a -> a -> (s,a)

-- A history-update will always add the given 'a' to the front
-- of the history. Then, it will typically merge two frames 
-- that are later in the history. Though, if the history is
-- smaller than its max length, it may increase its size.
historyUpdate :: a -> History s a -> -> History s a
historyUpdate aUpd h = 
    -- full implementation elided, but basically:
    --   use exponentialDist, halflife, and rng to obtain an index
    --   reduce selected update index if greater than maxlen 
    --   add the aUpd element to the front of the list. (aUpd:h_hist h)
    --   unzip and apply loss function to the elements at index and index+1...
    --     (unless there is no index+1 element)
    --   build and return the updated history 
    --     new hist (~ reverse lower ++ [aMerged] ++ upper ... but efficiently!)
    --     new state
    --     new rng

This is a very simple model. I would not be surprised if other people have developed something similar on their own, though I have not found references and exponential decay was novel to me when I developed it. I have already found applications for exponential decay of history in my day job, and I expect it will become one of a few cornerstone state models for RDP.

An interesting benefit of this probabilistic approach is that it can also be applied to temporal information. And we can help ‘smooth’ it out by simply computing multiple locations for *where* to decay (e.g. pick three or seven possible decay targets) then choosing the one that *loses the least information*.

I encourage you to find applications for this state model, to make use of it in practice, to share and blog your own experiences.

This entry was posted in Language Design. Bookmark the permalink.

15 Responses to Exponential Decay of History, Improved

  1. Pingback: Clojure and me » Decaying lists: log scale for lists

  2. johnicholas says:

    There might be a relevant analogy with maps. There is a famous new yorker cover, called the ‘view from 9th avenue’, which characterizes china as roughly the same size as China. http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue
    Of course, every location has a different (egocentric) map:

    It doesn’t immediately seem like a map, but a signpost with the direction to take for various small local regions as well as big remote regions, something like: “Northampton, Amherst, Points North go thisaway=>, Springfield, Holyoke, Points South go thataway=>” also is a map with this sort of exponential distortion. Routing tables are another example of this sort of signpost map.

  3. LE says:

    Very interesting stuff, and not by any means a novel idea sadly!

    I think one a major application of this concept is in backups, which snapshots to delete. I’m not sure if Mac OS X’s Time Machine technology employs a scheme like this. I wouldn’t be surprised if they did.

    I would love to see this implemented generically in Python, I think it would be rather hard.

  4. Corey Smith says:

    This idea is very similar to the technique used to store time series data in rrd files:

    http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html

    • dmbarbour says:

      I think it fills the same niche, but with different characteristics. As an aside, it would be easy to tune between ’round robin’ and ‘exponential decay’ just by adding a constant to the exponential-decay index (or, equivalently, push the dropped frames from a round-robin database into the decaying history).

  5. Pingback: fogus: The best things and stuff of 2013

  6. Thanks for you article, here is link of a similar idea: http://word.bitly.com/post/41284219720/forget-table

  7. pepijndevos says:

    So this is random right? I’m not good at math, but if you have a list of transactions with incrementing IDs, those IDs should be the derivative of the exponential distribution right? So I think you could just merge all the transactions with an ID higher than the derivative for that index in the list.

    • dmbarbour says:

      What context are you imagining?

      • pepijndevos says:

        A database like Redis+Datomic. In-memory persistent datastructures keeping history.
        I’m trying to make something like your random solution that is uniform instead. Think of it as a perfect exponential random number generator with the cardinality of the DB as the seed. I’m better with bits than with math, so I came up with this:

        (defn exponential-distribution
        “Between N and N+1, find the bit that became one.
        This exponent of 2 has 1 bit set.
        Decrement by one to set all bits below.
        Count the bits.”
        [n]
        (Long/bitCount
        (dec
        (bit-and (inc n) (bit-not n)))))

        http://www.wolframalpha.com/share/clip?f=d41d8cd98f00b204e9800998ecf8427e7lbpl5m7qh

      • dmbarbour says:

        Look into “Tower of Hanoi backup”.

      • pepijndevos says:

        That looks a lot like what I’m doing. I havn’t found any algorithms though. Just the scheme itself.
        The limitation of what I’m doing is that it is limited to 64 items and exponents of 2.

      • dmbarbour says:

        It occurs to me that you might be better served by my original version of exponential decay – deterministic, non-random, based on a series of ring-buffers with some fixed fraction of each ring being passed to the next.

  8. Pingback: Logarithmic History v3 | Awelon Blue

Leave a comment