Logarithmic History v3

Exponential decay, resulting in log-scale lists, an excellent way to work with an unbounded amount of history in a bounded amount of space. I’ve described this in two previous articles [1][2]. In my first article, I modeled logarithmic decay via a sequence of ring buffers: each ring would lose (or merge) one of its items rather than passing it forward. In my second article, I expressed a concern that ring buffers might lose information that occurs at a regular frequency, and modeled the decay probabilistically instead. The latter ‘improved’ form of exponential decay is certainly easier to implement, and could be generalized in interesting ways (such as time-based collapse). But original design had a nice property too. For example, it was extremely predictable.

In this article, I describe an even simpler model:

  1. Choose a maximum number of entries for your history buffer.
  2. Whenever you reach the maximum, decimate your history.
  3. To decimate: Take groups of ten entries. From each group, kill one.

Trivial, eh?

The word ‘decimate’ means to kill one man in every ten to punish or subdue a group. This results in a 90% survival rate. Repeatedly decimating our history will give us exponential decay. Operating on groups of ten keeps the operation simple, and ensures the decay process is relatively uniform and predictable. Of course, the particular choice of tenths is arbitrary and could be adjusted without loss of generality.

The decision, then, is how to choose one element from each group to destroy.

The probabilistic approach is to roll a ‘die!’.

The probabilistic approach can be deterministic in a simple way: use a hash function on each group of entries to roll a die for that group. This sort of technique would be convenient when working with replicated histories, distributed databases, and so on.

A non-probabilistic model might heuristically select one entry to erase based on estimated loss of information. For example, if we’re erasing camera frames, we might favor deleting a frame that looks most similar to another frame in the group. Or if our history consists of a path of a robot, we might favor deleting entries where the robot was basically sitting still or moving in a straight line.

Of course, merging is also a fine idea. We could choose two items to merge, ideally in an associative manner such that merge(a,merge(b,c)) = merge(merge(a,b),c); this can be done with weighted averages, keeping counts and sums and sums-of-squares, and generally developing a monoidal type for history entries.

I developed this particular approach to logarithmic history to support Wikilon. Logarithmic history offers a happy medium between keeping a complete history and keeping the most recent version of each page/word. It does have a few disadvantages. For example you can’t get a “permanent link” to a historical version of a word. But you can at least browse historical versions of the wiki/dictionary, get a story for how a system evolved, access old versions of ideas. And the long-term space requirements are very predictable.

Posted in Language Design | Leave a comment

Introducing Wikilon

I have decided to move forward with a wiki-based development environment for Awelon project, under the name Wikilon. Wikilon will be implemented as a Haskell warp web service using acid-state for persistence and websockets for liveness and reactivity.

Wikilon can overcome several limitations that I feel to chafe in the current text-editor + console REPL environment. Because Wikilon will be a persistent web service, it can take much more advantage of incremental and persistent computation – e.g. recompute test results only when the definition of a word used by that test has changed. Wikilon can also serve as a much better platform for long-running behaviors, thus allowing me to move forward with reactive demand programming. Finally, a web browser provides a richer canvas than a console REPL, enabling me to pursue graphical outputs, animation of code execution, interactive programs, and embedded literal objects.

Before now, I’ve held off on developing a wiki-based development environment.

My hope was to bootstrap the language, and implement the web services in my own language. But it seems the further I get along with developing the AO dictionary, the further I have to go. There is an enormous amount of logic involved in warp, acid-state, filesystem IO, and the like. Rather than slogging through all that, I now feel Awelon project and my mental health would be better off were I to jump straight to visually entertaining applications, perhaps in the style of of tangible functions or Logo writer.

Bootstrapping can happen later.

Wikilon will be user-extensible, in the sense that developers can add new web services or web applications simply by defining new words in the AO dictionary. AO leverages ad-hoc naming conventions, e.g. words with prefix `test.` indicate tests that we should run. In this case, new web services might be installed by defining a word like `wikilon/foo` to process HTTP requests on the /foo path. It’s essentially an in-language variation of CGI. However, AO is much easier to secure than CGI, and also subject to partial evaluation (e.g. for a given set of query parameters). Eventually, these web services will provide upgraded interfaces for editing the wiki… and cross-compilation to stand-alone services or applications.

Wikilon shall still provide a REPL-like service, perhaps via websockets.

But I’d like to shift towards a more persistent, reactive testing environment, in a style akin to iPython notebook. As a developer edits definitions for words, they should be able to immediately observe the effects on whichever notes they’re viewing. A given note shall simply be another word in the dictionary, hanging around for anyone who wishes to view it. Similarly, all those `test.` words should be automatic without pressing any buttons, and ideally we should be able to visualize how test and note inputs interact with whichever words we’re editing.

(It could be useful to automatically generate and propose tests, e.g. based on observed failure cases.)

Wikilon’s persistence will make it a lot more scalable than my current approach of parsing the entire dictionary every time I start up the REPL. Granted, this isn’t difficult. But my current approach would easily scale to 10k words; even naively implemented, Wikilon should easily scale to a million. A single Wikilon instance might host thousands of projects and users. But rather than a ‘global’ wiki, I expect we’ll want private forks and DVCS-like push/pull relationships, with public wikis serving a role similar to hackage.

The idea of user programmable wikis isn’t new. But I believe AO is better suited for this role, e.g. with respect to security, portability, concurrency, and composition. AO was designed with wiki-based development in mind.

Posted in Language Design, Live Programming, Open Systems Programming, User Interface | Tagged , | 1 Comment

Awelon Progress Report VIII

In the seven weeks since my last report, I’ve primarily focused on growing my AO dictionaries. This was my intention.

Seven weeks ago, I had 1298 words. I’m now up to 2395 words (an 85% increase), including 605 documentation words, 134 test words, and 310 Lisp-based `car/cdr/cdaddr` words (in five variations, including a functional `setcadadr!`). Many of these are undocumented intra-functional words, e.g. for the word `foo` that processes a tree I might implement multiple words of the form `tree.foo`, `leaf.foo`, and `node.foo`. The word `foo` then arranges inputs on the stack and applies the fixpoint.

In addition to the trivial Lisp-based words, development was in four main areas: streams, processes, pseudo-random number generation, and binary search trees.

I’m not entirely satisfied. I had hoped at this point to be finished finger tree sequences and maps based on self-balancing binary search trees. But I’ve only gotten so far as prototyping maps with a plain old imbalanced tree.

In addition to my primary efforts, my secondary efforts include:

  • Shift my ad-hoc document-based ‘TODO’ lists into the Github issue tracking system. (incrementally ongoing!)
  • Efficient binary representations in ABC, at least for storage and transmission. This issue kept distracting me when I was working on words for bit-stream processing.
  • Candidate algorithms for compression for ABC resources. I’m leaning towards a variation on LZSS at this point that guarantees 1:64 worst-case expansion (in addition to the 1:128 overhead for large binaries). Most of ABC should compress very well with LZSS.
  • Developed a fixpoint combinator with static transport overhead. Before, fixpoint required 2N+8 bytes to fixpoint an N-byte block. It now requires N+30 bytes and is more amenable to optimization and compression. This avoids exponential expansion for layers of recursive processes, and should result in more efficient storage and transport of ABC.
  • Attempting to more clearly express my ideas, especially why I’m developing a new bytecode to start with. I’ve expressed some of them in the code as material post, though that only hits one aspect and doesn’t really touch the importance of streamable code to my vision for Awelon project. I’ve also tried to tweak the AboutABC opening a bit.
  • Developed a model for cryptographic value sealing that is simple in both concept and implementation, and that I feel is promising enough to prototype. Progress here was triggered largely by the development of efficient binary representations sufficient for cryptographic values. Value sealing is a much more accessible model of cryptography than experienced in most conventional languages.

While I might wish for faster progress, I’m certainly feeling the difference these last couple months have made.

What’s Next?

I shall continue developing the dictionary as my primary effort. Despite my progress, I’m still a few layers of abstraction below implementation of type systems, grammars, constraint models, optimizing compilers, web services, abstract runtimes and VMs, and so on.

Eventually, around 5k-10k words, I’ll start to feel the limits of my current implementation. Every 1k words takes about 0.1s to load into memory, more if using the rather ad-hoc prototype of separate compilation.

When it becomes intolerable, or (more likely) if I feel enough progress has been achieved to develop web-apps and an ABC-to-JavaScript compiler in AO, I’ll shift my efforts towards re-implementing AO and ABC development environments as web services, with a more persistent database that can be incrementally updated, tested, etc. and never needs to be reloaded. I have many ideas for a more visual ‘live programming’ development environment that could be implemented in a browser, in addition to a wiki-like experience.

Compilation to native code, C, OpenCL, LLVM, or similar will still be an important target. Kyle Blake is developing abcc, an ABC to C compiler (at the moment), as part of a graduate project.

Posted in Language Design | Leave a comment

Binaries in ABC (or UTF-8)

Awelon Bytecode (ABC) doesn’t have good direct support for binary data. Binaries are not uncommon: cipher texts, secure hash values, compressed visual or audio data, raw sensory data, and so on. Efficient storage and communication of binaries is a desirable feature.

This morning, I had a spark of an idea: a compression algorithm can be specialized to recognize a large sequence of base64 data and simply re-encode it as a run length of binary. This simple technique cleanly separates the concerns of simple representation at the language layer vs. efficient storage and transmission of binary data.

But base64 is hardly relevant in this case, isn’t it? We could use the simpler base16 and reap similar benefits.

Awelon Bytecode is encoded in UTF-8, and may use the full UTF-8 via embedded text. A useful property of UTF-8 is that it has thirteen unused bytes: 0xC0, 0xC1, 0xF5..0xFF. It is feasible to simply usurp one of these bytes to perform a compression pass specialized for UTF-8 that is expected to embed base16. Proposed format:

  • header byte: 0xF8
  • single length byte L, encoding 3..256 bytes (0xFE,0xFF not used)
  • thus encoding 6..512 base16 characters (always an even number)

Thus, we have UTF-8 with efficiently embedded base16 binaries. The overhead for large binaries is less than 1%. The break-even point with base64 is 6 bytes. The encoder requires 512 bytes lookahead, and the decoder requires no buffer. I can easily imagine applying this idea to HTML+JavaScript. Further compression remains quite viable.

For ABC in particular, I’m considering use of a specialized base16 alphabet `bdfghjkmnpqstxyz`. This is mostly to avoid spelling offensive words (no vowels) and avoid interference with numbers (0-9) or ABC data plumbing (mostly `vrwlc`).

Posted in Distributed Programming, Language Design | 2 Comments

Code is Material

‘Code is data’ is a metaphor around which the Lisp communities (and REBOL, and others) have built programming languages. The metaphor is effective in these languages because code is presented and represented as a simple structure which may easily be manipulated and processed as data, or evaluated; this is leveraged for development of macros, fexprs, and DSLs. In context of other languages, e.g. C++ or Java, the ‘code is data’ metaphor is much less effective. Metaphors are useful insofar as they shape our tools, our intuitions, our communities.

‘Code is material’ is a similar metaphor, an alternative to ‘code is data’. I’m not aware of any community that consciously uses the ‘code is material’ metaphor, but it does seem applicable in retrospect to flow-based programming and some visual dataflow languages, and perhaps a few esoteric languages like Snusp. By ‘material’ I mean to invoke the intuitions surrounding physical substances used to build other products – e.g. wood, metal, glass, rope. We have many intuitions surrounding materials:

  • materials are portable and fungible by nature (and often available in bulk)
  • behavior of a material is intrinsic or based on universal physical laws
  • materials can be composed or shaped into new materials (molecules, fabrics, bricks)
  • materials vary widely in quality, robustness, ease of use, etc.
  • we don’t ask whether materials are meaningful, only whether they are useful

Of course, code does have an advantage over physical materials: code can be cheaply and precisely replicated.

‘Code is data’ and ‘code is material’ stand in opposition with respect the presence of an ‘interpreter’. Data is interpreted. It is possible to have multiple interpretations of the same data. Materials aren’t interpreted. The behavior of materials is intrinsic or based on universal laws.

For distributed programming or mobile code systems, I like the idea that my code should have the same behavior everywhere. For reasoning, refactoring, and optimization purposes, I like the ability to transform code in systematic ways while preserving behavior. For modularity and security purposes, I like the possibility of hiding implementation details or information within code and closures. These desiderata are better supported if the behavior of code be universal, independent of interpreter.

I imagine, if we take the ‘code is material’ metaphor far enough, that we would be working with composable bricks or structures of code in a virtual reality or augmented reality programming environment, replicating them as needed. Occasionally we might disassemble and reshape a structure for use in a particular problem (perhaps manipulating embedded literal widgets). There would be very little need for a conventional programming language; programming could be approached more as mechanical tinkering. The resulting code-is-material structures may be given part numbers based on secure hash, or directly shared with friends and communities. Applications would consist of devices or tools, themselves constructed of code, of an appropriate type to be ‘installed’ or ‘equipped’ in the programmer’s environment.

Awelon project is aiming towards such a future. The Awelon Bytecode (ABC) is an excellent foundation for the ‘code is material’ metaphor. Awelon Object (AO) is a Forth-like language, allowing users to define words. But those words do not remain entangled with the environment: every AO word is compiled into a finite and environment-independent sequence of ABC, and thus is easily treated as a named ‘brick’ of material, readily ported to other programming environments.

ABC has many useful properties that make it especially suitable for the ‘code is material’ metaphor:

  1. ABC is highly compositional. Concatenation is sequential composition. Concurrent composition is possible due to the causal commutativity requirement. Composition of conditional behaviors is available because ABC favors sum types instead of closed if/then/else behaviors.
  2. ABC code interacts with its environment locally. Every operator manipulates a local value. ABC lacks the conventional heap, pointers, and remote references. Thus, ABC’s behavior is similar to interactions between physical materials.
  3. ABC has atomic and molecular structure: at the lowest level, there are about forty bytecodes that can be understood as atoms. A small, reusable sequence of bytecodes can be understood as a molecule.
  4. ABC’s molecular structure is very linear, similar to genetic code. It is quite feasible to search for new ABC materials by genetic programming, substituting and mixing reusable ‘genes’ (of appropriate type) in a larger program, which might then construct other objects. We can deeply leverage ABC with biological material metaphors (wood, thatch, etc.).
  5. ABC is streamable, i.e. instead of the common stored program concept. A streamable program cannot jump backwards. ABC cannot jump at all. (Loops are modeled using fixpoint combinators; no jumping required.) This allows ABC to model not just ‘solid’ materials such as bricks, but also unbounded ‘fluid’ materials, such as command streams or user input.

One of the underlying ideas for Awelon project is the ability to extract reusable code – e.g. user macros and tools – from the user input stream. This is a form of programming by example. And the envisioned Awelon project environment is certainly leaning towards the VR/AR system described earlier. I’m happy to have found a simple metaphor to describe almost every idea underlying Awelon project: code is material.

Aside: A related metaphor is that ‘programming languages are materials’, occasionally presented in opposition to ‘programming languages are tools’ [1]. But I feel it isn’t just the language that is the material; it’s also the frameworks, the libraries… the code in general.

Posted in Distributed Programming, Language Design, Modularity, Open Systems Programming, UserInterface | 14 Comments

Awelon Progress Report VII

The two months since my last report have been productive, albeit not in the directions I was aiming to proceed. Much of my work was exploratory. I’ve gained little of lasting value, but at least I’ve gained a little.

In June, I developed a graph based intermediate representation for compiling Awelon Bytecode (ABC) into Haskell, with the idea of eliminating intermediate data-plumbing (e.g. the `lrwzvc` operations) and perhaps supporting a fair bit of partial evaluation (e.g. so `#42` results in number 42). My efforts here were moderately successful. I was executing microbenchmarks in about 40% the time compared to the previous naive JIT. Though, this isn’t so impressive when I reframe it as mere a 20% improvement over the interpreter. Unfortunately, I never did get tail calls working properly in the resulting Haskell code.

I gradually grew frustrated. I attribute my frustration an ad-hoc mix of the long edit-compile-test cycle in Haskell, the pathetic performance of the existing JIT code, the fact that I was not making progress developing the AO dictionary, and my inability to precisely compile subprograms.

By late June, I had a set of ideas for moving forward:

  • shift responsibility for the ABC to Haskell compiler into AO (i.e. to further develop the AO dictionary)
  • enable AO to specify separate compilation on a per-word level, i.e. so words can act as ‘hard’ components
  • enable JIT modules to reference one another, i.e. so we can better reuse memory resources
  • properly unload JIT modules, e.g. between ‘paragraphs’, so they don’t accumulate indefinitely in memory

Enabling modules to reference one another was easiest, so I did that in a couple hours – simply shifting all modules into a single root directory (albeit with a toplevel prefix like `Awx.Byz.` to avoid a super-wide flat directory). Modules are already named via secure hash of the compiled bytecode. Presumably, I could now compile code of form `{#secureHashOfBytecode}` to import the appropriate Haskell module.

My initial idea for separate compilation in AO was to compile words starting with `#`, e.g. the word `#foo` would be compiled separately. I implemented this idea without much difficulty, introducing an additional Precompile step. In retrospect, the `#` prefix criteria proved aesthetically ugly and painful to use, needing whole-dictionary refactoring just to tweak performance. The ‘use prefixes’ idea also composes and extends poorly, since new prefixes would bury older ones. After some brainstorming (annotations? separate file? dictionary driven?) I eventually decided to try a ‘directive words’ concept: if you define `compile!foo`, the AO system will separately compile word `foo` (if it exists). I still have mixed feelings about this, but it works well enough.

At some point during these developments, I got very side-tracked by network implications.

Awelon’s approach to separate compilation and linking is non-conventional in order to support open distributed systems and streaming code. Naming reusable subprograms by {#secureHash} can save a great deal of bandwidth and more effectively reuse memory. And programmer directed separate compilation of select words in the AO dictionary offers an effective jump start to use of these resources in the short term. (I hadn’t expected to make use of the separate compilation model this early in AO’s development.) I got to wondering about cloud services, especially in their role as content distribution networks. If we want to store ABC resources with an untrusted proxy or CDN, we must encrypt the bytecode, and keep the decryption key in the name. I wrote an article about this idea, achieving Tahoe-LAFS like provider independent security for ABC resources. Further, if we’re going to encrypt the code, we can’t really compress it afterwards, so we should probably compress before hand to save storage and bandwidth. ABC code should compress remarkably well. But compression algorithms seem to introduce a lot of ambiguity in their decision making (e.g. where to break up blocks)… which could be problematic for deterministic resource naming. I experimented with a few ideas. I’m still not sure which compression algorithm I’d favor.

I eventually tabled the whole compression and encryption effort for now, using ‘id’ function for compression and encryption. But I’m sold on the idea, and I plan to proceed with it eventually. ABC resources encrypted by default should prove an enormous boon for distributed programming.

I didn’t make any progress at all on re-implementing the ABC-to-Haskell compiler in AO, nor have I gotten around to properly unloading JIT plugins (plugins are disabled at the moment). I’m starting to wonder whether compiling to Haskell is a dead end. The interpreter is reasonably fast, and perhaps I should be shifting efforts towards a true bootstrap, e.g. targeting C or OpenCL or LLVM or JavaScript.

Secondary Efforts

While compilation and JIT were the bulk of my efforts, I’ve been doing a little design work on the side.

Back in April, I mentioned that I wish to also work on automatic type checking, i.e. to detect those silly errors that often pop up when we’re editing code. I’ve been keen on trying pluggable type systems. Awelon project implies a certain degree of structural and substructural typing (e.g. numbers, pairs, sums, unit, blocks, affine, relevant, sealed values). But I think we can go a lot further by making sets of ‘type hypotheses’, e.g. recognizing that some structure is essentially used as a list or a stack.

My current idea is to represent multiple type systems in a single AO dictionary, using a simple naming convention (a prefix) to identify type systems. Each type system will have some opportunity to output results about each program – e.g. types, failures, indecision, errors and warnings. These reports are returned to the programmers, who can decide what to do with them (ignoring is a possibility, e.g. if an experimental type system is known to give bad results). In a good IDE, type feedback for multiple type systems may be continuous as we make edits on a word.

This might be combined with a generic approach to type declarations, whereby we specify that some block of code should have a given type via standard ‘type system’ API. The main purpose here is the ability to declare parametric polymorphism and similar properties that are otherwise difficult to infer.

Another idea I’m working on is embedded literal objects, which allows extending Awelon project’s literal types beyond simple numbers and text, and may prove very useful as a basis for embedded DSLs and rich data description.

What’s Next?

Right now, more than anything else, I need to grow my AO dictionary.

My work on that front has fallen way behind in the last several months. A 392-octet microbenchmark isn’t enough to observe the memory reuse and caching benefits for separate compilation and linking. And, realistically, I need a much richer dictionary before I can effectively approach bootstrap or JIT. Further, ABC has a remaining approach to parsimony and performance that I haven’t even started to develop: Awelon Bytecode Deflated (ABCD), using higher UTF-8 bytecodes to predefine a standard dictionary of popular or performance-critical functions. To develop ABCD requires a massive and maturing ABC code base.

I’d like eventually (this year) to work on the RDP interpretation of ABC. Fortunately, RDP isn’t too difficult to implement except for the difficulty of integrating external systems (filesystems, databases, sensors and actuators, etc.). Unfortunately, integrating external systems with a runtime can be somewhat painful. RDP will take some time to make useful, and I’d rather not spend the next couple months in Haskell runtime code. (Especially since I’d also like to gradually deprecate the Haskell runtime and bootstrap a new one.)

My focus for at least the next six weeks will be the dictionary, which I’ll grow by developing simple projects. I’ll certainly need data structure support if I’m ever to advance my bootstrap efforts.

Posted in Language Design | Tagged , | 2 Comments

Embedded Literal Objects

When a PL is designed for a text editor, it is natural that text is the richest embedded literal. The text editor provides the visualization and update HCI. And text can certainly be useful; beyond modeling a subset of human input or output, it can model embedded DSLs such as regular expressions. But text is ineffective for many kinds of data we might wish to express in our programs – e.g. 3D meshes, logos, diagrams, graphs, grammars, matrices, decision trees, scene graphs, sound models, stage data in a 2D video game, dance notation, skeletal animation, spreadsheets, and so on.

Conventionally, to work with these richer models requires we use external tools – e.g. Maya, Blender, Photoshop. However, use of separate tools has many disadvantages, such as poor support for abstraction, partial evaluation, integration. It can also be inconvenient to load separate tools for a quick peek or edit.

A few not-so-conventional languages, such as pure data [more], are designed for work on a canvas and provide support for diagrams as literals. And then there are fully interactive programmable systems, such as Croquet project, where we manipulate first-class objects. The latter has advantages for flexibility and programmability, but also disadvantages with respect to entangling the code with the runtime.

I’ve been interested for a while in supporting flexible literal types for my Awelon project. Last week, my I was inspired towoard a simple model that I call embedded literal objects (ELO). An ELO is effectively just a block with some conventions surrounding it for display and update, leveraging edit-time computation. Relevant bullets:

  • represented by block of pure bytecode embedded in source code
  • fixpoint closure type, e.g. `µELO.(Command→(ELO+OtherResult))`
  • commands are either updates or queries
  • IDE processes updates with ELO and writes resulting ELO into source code
  • queries can obtain menus of command, rendering to canvas, access data
  • development of commands and queries left to convention

Like text literals, an ELO can be copied and pasted, version controlled, and readily reused or re-edited across projects. There is no entanglement with the program context. Unlike text literals, an ELO potentially has much richer structure for easy composition, decomposition, and abstraction. We’re likely to develop ELOs that measure many megabytes in size. Though, a lot of redundant logic, e.g. for command menus, rendering, could be compressed via bytecode-layer linking model.

For current Awelon project languages, the proposed serialization of ELOs is quite trivial:

〚raw awelon bytecode here〛          in AO
[raw awelon bytecode here]{&elo}l   in ABC

The dedicated syntax for ELOs in Awelon Object (AO) language helps the IDE recognize them for special treatment. I’m using the unicode white variant square brackets: U+301A and U+301B. The resulting bytecode is just a plain old block with an extra annotation `{&elo}` indicating its nature for bytecode level debuggers or extraction. The final `l` operation (type `(a*(s*e))→((a*s)*e)`) shifts the embedded literal object onto the AO stack like all other AO literals.

I love the idea that, modulo partial evaluation, these ad-hoc literal objects remain recognizable at the Awelon Bytecode (ABC) layer. This is valuable for both debugging and extraction at the bytecode layer, and also is consistent with ABC’s existing approach to pseudo-literal numbers (e.g. `#42` is a sequence of three bytecodes that generate the number forty-two on the stack) and embedded text (shorthand for a list of small integers). When programs fall to pieces, comprehension starts at the bedrock, which for Awelon project is ABC. Every little bit of consistency, simplicity, transparency, and legibility at the bytecode level will help developers.

A new ELO could easily be generated from the AO dictionary as it stands. An AO IDE might also provide a menu of ELOs by searching for words starting with prefix `elo.`.

Processing ELOs

For text-based embedded DSLs, we typically use a command to interpret or compile the text. For example:

"x y z → y^2 + x^2 - |y|" doTheMath

Of course, text works well for only a small subset of DSLs, such as regular expressions. Math really isn’t among them if we want rich symbology, matrices, big division bars, and so on. An Awelon system can potentially optimize this behavior by systematic application of partial evaluation, i.e. such that we do not process the string on every execution.

With ELOs, we have greater flexibility. An ELO has its own logic, data, and can represent both rich and abstract data structures. We might ask an ELO to compile itself, or extract a value, or query a specific field, or to render itself at runtime. An ELO exposes the exact same set of commands to the runtime as it does to the IDE. ELOs can be treated as simple values, copied as many times as needed.

Updating ELO Types

Individual ELOs can be updated via the available commands. But how do we update the set of commands? For example, if we have an ‘image’ type with a canvas and a bunch of editing commands, how do we add new brushes and commands? I can think of several approaches:

  1. extensible command set: ELO command sets may be extensible and replaceable by convention, via specific commands
  2. explicit import/export: ELOs may provide an ‘export’ command that is readily imported by the next version of the ELO
  3. adapter pattern: wrap the ELO in an adapter that can translate commands and responses, at cost of performance
  4. extract from ELO: bring debuggers to bear on upon the bytecode and surgically extract the information we care about.
  5. extract from history: record commands that updated an ELO, and replay them on the new version of the ELO
  6. IDE layer extension: recognize common ELO types (e.g. based on query for type info) and install command sets at IDE layer

I suspect the answer to the ELO type update problem will be some ad-hoc mix of all of the above. The first two are preferable, and are ultimately achieved through convention and best practices in ELO design. The adapter pattern isn’t ideal, but can work well enough if not layered too deeply. The last few lean heavily on the integrated development environment, but might serve as effective fallbacks. Having a history of commands on ELOs would be useful anyway for extracting user macros into reusable subprograms.

In any case, the update problem is analogous to modifying text literals for an embedded DSL after a language version update. Developers will systematically update all instances until they get bored of doing so and start designing with forward compatibility in mind. At that point, they’ll update only the literals that need updating. Likely, there will be a recognizable ELO maturation life cycle.

The Way Forward

At the moment, my Awelon Object language is still trapped in a text editor, and ELOs won’t do me much good.

My intention is to eventually upgrade to editing code via a web browser, perhaps in a wiki-like environment. ELOs aren’t my only goal; I’m also interested in hyperlinks, live programming, automatic visualization of the tacit environment (perhaps from live tests), interactive documentation (tutorials, etc.), and so on. I think ELOs will become much more viable in such an environment.

There are a lot of details remaining. For example, I’ll want to develop an API for rendering to canvas, possibly inspired from HTML or SVG. Animations and sound are interesting areas for literals; I’d love the ability to ‘hear’ whatever sound model I’m looking at, or see animations running in a loop. For 3D or larger 2D structures, ability to rotate, pan, zoom, etc. could be useful.

Posted in Language Design, Modularity, Types, User Interface, UserInterface | 1 Comment

LZW-GC Streaming Compression

LZW is a simple compression algorithm – certainly the simplest, most deterministic I could find in my investigations. And here I use ‘deterministic’ in the sense that there are no implementation-dependent decisions, ambiguities, smarts or heuristics. LZW has only one real decision to make: dictionary size. And for my application, which requires highly deterministic compression, I’m willing to fix the dictionary size.

But LZW has a weakness: after the dictionary is filled, no new patterns are recognized. And the dictionary, at this point, is very poorly utilized: there are many not useful patterns.

There are existing approaches to mitigate this weakness, e.g. including a ‘clear code’ to reset the dictionary, or compressing chunks at a time. However, these tend to require some heuristic decisions that undermine LZW’s wonderful determinism. They also tend to lose old but valid patterns.

In this article, I propose a simple alternative, garbage collect the dictionary with an exponential decay factor.

A preliminary (but flawed) approach:

  1. To every element in the extended dictionary, add a match count.
  2. Increment match count on match (even when we don’t output the token).
  3. When dictionary is full, output current match token & garbage collect.
  4. To garbage collect (GC), remove zero-match elements & cut match counts in half

LZW is quite optimistic when growing its dictionary. In practice, we can expect that many entries go unmatched, and thus will be available for GC. LZW attempts to grow its dictionary after every output token. Thus, we should expect to perform a GC roughly every dictionary-size outputs. The exponential decay factor results from cutting down match counts in half, thus favoring newer patterns over older ones.

The flaw of this approach is that not all patterns have equal opportunity to find matches. Information about patterns discovered just before GC is more likely to be lost. This flaw can be mitigated by incremental GC. We can GC exactly one token just before every allocation. In a round-robin fashion, we can halve the match count of tokens until we find one with zero count. That’s the one token we collect.

This can potentially result in unusual cases where we ‘match’ patterns that never existed in the original input, i.e. because the dictionary has some intermediate nodes that were collected. This is unlikely but okay; it doesn’t hurt anyone. The dictionary doesn’t change between outputs, and we’ll always be able to decode any token the way the compressor meant it.

To decode, for each input token we recursively increase ‘match count’ for that token and all its dependencies. We also perform a round-robin GC after every input, and keep track of the free token. It shouldn’t be very sophisticated. We’re simply reproducing the same dictionary used to generate the output.

This extended form of LZW, which I’ll just call LZW-GC, is very similar in some ways to sliding window compression algorithms. But I think the dictionary is simpler, especially on the encode. Theoretically, LZW-GC should be superior to a sliding window of equivalent size because LZW-GC can make effective use of older patterns that are long gone from the window. To complete LZW-GC, we should follow it with a Huffman encoding to compress frequent tokens. This is independent of GC at the LZW layer.

Where’s the Code? The code is available on github, along with compression results.

It turns out that LZW-GC is about the same as LZW in most use cases, but significantly better for a few very, very large inputs. I’ll need to explore some ways to speed up dictionary construction, such as adapting LZAP (a variant of LZW).

Posted in Language Design | Leave a comment

ABC Linking with Provider Independent Security

As I’ve mentioned in other posts, Awelon Bytecode (ABC) has a non-conventional approach to code reuse. A valid sequence of bytecode can be given a deterministic, cryptographically unique name – e.g. a secure hash. In ABC, the code to invoke this external bytecode resource might be represented with `{#secureHashOfBytecode}`. Logically, the identified bytecode is inlined in place. In practice, this indirection provides a fine opportunity for separate compilation and caching.

In case of distributed systems, a secure hash offers a potential advantage of location independence. You can download the identified code from anywhere then validate it against the secure hash. Unfortunately, the code itself is exposed to whomever stores it, and in general code can contain IP sensitive information. So we cannot upload the code to anywhere. With just a secure hash, we’re restricted to trusted hosts for storing code. I would prefer a property the Tahoe-LAFS community describes as provider independent security – i.e. that the host cannot inspect the data. This way, we can cheaply rent a few untrusted cloud servers to handle the bulk of data distribution.

It turns out that only a small tweak to ABC’s naming system is needed to support provider independence:

  • use the secure hash of the bytecode as a symmetric encryption key
  • use the secure hash of the cipher text as the lookup key
  • store the cipher text, indexed by its lookup key
  • access via `{#secureHashOfCipherText:secureHashOfBytecode}`

I’ve studied computer security a lot from the PL perspective, but I don’t consider myself an expert in cryptography. To help validate this solution, I compare it to Tahoe’s approach. Essentially, I need a tiny subset of Tahoe’s features: I don’t need mutable files, I don’t need erasure coding or K-of-N recovery, and I don’t large multi-segment files (I can decompose code at the ABC layer).

Tahoe’s documentation describes: “The capability contains the encryption key, the hash of the Capability Extension Block, and any encoding parameters necessary to perform the eventual decoding process. For convenience, it also contains the size of the file being stored.” And here the Capability Extension Block is derived from hashing the ciphertext – albeit, with a layer of indirection to support incremental loading of multi-segment files.

ABC’s model for a bytecode resource capability is very similar. The main difference is that ABC’s encryption key is determined from the bytecode content. And there is no capability extension block indirection, no segmentation at that layer. If there is a weakness in this design, it’s because the encryption key is easily computed from content. But, assuming the hash really is secure and independent of the encryption algorithm, I don’t believe that’s a feasible attack.

Having the final capability text be deterministic from content is valuable for interning, caching, reuse of popular software components (templates, frameworks, libraries, etc.). I would be reluctant to give it up.

Recently, I’ve been deciding on a few concrete encodings.

I’m thinking to use SHA3-384 as underlying secure hash algorithm, but using the first 128 bits for secureHashOfCipherText and the last 256 bits for secureHashOfBytecode. This should maintain the full original uniqueness; certainly, if our encryption function was identity, we’d certainly recover the original 384-bit secure hash. Those secure hash bits each will be encoded in ABC’s base16 to leverage a special compression pass for embedded binaries. At this point I favor AES-CTR (with zero nonce) for encryption of the bytecode.

Compressing the bytecode prior to encryption is potentially a worthy investment. Doing so could save in storage and network overheads. It’s infeasible to compress after encryption. And a primary consideration is that the compression algorithm should be highly deterministic (almost no room for choices at compression time). A special compression pass will allow embedding binaries in ABC, and a more conventional compression pass will tighten up repeated sequences of bytecode. (Worst case expansion for uncompressible large binaries should be under 2.5%.)

Anyhow, my original plan was to just use `{#secureHash}` resources and use a distinct encoding and pipeline for the encrypted variation. Now, I’m leaning towards just using the encrypted, provider-independent security naming convention at all times, even when the object never leaves the local machine. Decryption, decompression, and compilation are essentially one-time costs (if we use a good cache) and compilation will tend to dominate. The overhead for provider independent security is essentially negligible.

Convergence Secrets (Addendum)

Zooko – a primary developer on Tahoe-LAFS – notes in comments below that convergent encryption has a known security weakness to ‘confirmation’ attacks if the objects contain sensitive low-entropy information. Actually, confirmation attacks are a potential problem even without the encryption, but is a few orders of magnitude more so with encryption because we’ll be exposed to ‘offline’ attacks. Still, convergence is valuable in the common case for Awelon project (e.g. caching and reusing popular library code), so I’d be reluctant to abandon it.

Fortunately, this can be addressed by adding some artificial entropy. This can be done on a per-resource basis, so long as we have some policy to identify which resources are sensitive and thus should be protected. (In context of Awelon Object (AO) language, perhaps a dictionary word `foo` might be considered sensitive whenever a directive word `secret!foo` is defined.) Unlike Tahoe-LAFS, the secret doesn’t need to be part of the protocol; it isn’t a problem to compile it directly into the bytecode, e.g. by simply adding a random text literal then dropping it:

"bpgzmjkydmfyfdhdhptzmnbpdjkndjnsgdjyzpxbsypshqfk
~%

Of course, there are many advantages to determinism. So, rather than random texts, I might instead favor a secret per user and use an HMAC style mechanism to deterministically generate a convergence secret per unique sensitive resource.

Posted in Distributed Programming, Language Design, Modularity, Open Systems Programming, Security | Tagged , , , | 5 Comments

Compiling ABC/AO

I’ve spent most of this month developing a more sophisticated compiler for Awelon project, based on first translating the code into a dataflow graph, then translating to Haskell. This allows easy use of Haskell’s variables, and more effective use of pure subprograms. Which, presumably, GHC knows how to optimize. At the moment, runtime performance only slightly exceeds interpreted performance for a simplistic microbenchmark. But I have yet to implement a few simple optimizations on the intermediate graph. Progress on the compiler has been painfully slow, involving a lot of backtracking and exploration of designs (though it’s only a couple thousand lines of code). Anyhow, this post isn’t primarily about performance or progress.

I’ve been thinking about how I should expose compilation to programmers. This post is just me thinking out loud. You’re welcome to join in.

Currently, I support dynamic compilation by use of a `{&compile}` annotation. Annotations must have identity type and no effect observable within the program, but are able to support debugging or tweak performance – e.g. parallel evaluation, or compilation of a block, or runtime assertions. In this case, obviously, we’re compiling a block. But use of an annotation is very explicit, and can be somewhat awkward – e.g. requiring constructs like `[foo] compile inline` instead of just `foo`.

There is a good case for explicit dynamic compilation, especially in context of DSLs and metaprogramming. Leaving these to implicit just-in-time (JIT) compilation has always seemed awkward to me; often the programmer does know best. But there is also a good case for implicit compilation, based on traces or hotspots or similar.

At the moment, in context of Awelon Bytecode (ABC), I’m not sure how to go about implicit JIT. In conventional languages, source tends to have a stable name or location in memory, which is easily used to keep a counter to identify hotspots or traces. ABC is designed for streaming code, and ABC blocks don’t really have any name or identity beyond their structure, and loops often involve compositions of code. Awelon Object (AO) code does support names (it’s essentially a macro language for ABC) but still doesn’t have recursive loops. (I’m also hesitant to compile AO directly, because development in that direction might hinder development for streaming bytecode.)

One possibility is to leverage a streaming compression and dictionary building algorithm to gradually identify long sequences of ABC code that are seen often, then compile those common sequences to support implicit JIT. I feel this is worth exploring, but it also seems like a lot of work for potentially marginal gains.

Anyhow, I’m also interested in static compilation, and easily mixing static and dynamic components. Dynamic compilation is useful in this endeavor, but I feel there is more I could do. This morning, I had another idea (which should have been obvious in retrospect):

I can use naming conventions in the AO dictionary to drive static compilation.

AO already uses naming conventions to support automatic testing. (Words prefixed with `test.` become automatic tests.) Gradual development of de-facto naming conventions – e.g. to support typechecking, documentation, property testing, optimizations or rewrite rules, exporting applications – has been part of AO’s design from very near the beginning. The idea is that naming conventions do not change the formal meaning of a word (the macro expansion into ABC) but they do suggest contexutal application from an IDE or larger system.

The idea is that I could use a prefix such as `#` or a suffix such as `.exe` to suggest that words of the form `#foo` or `foo.exe` should be statically precompiled (in a runtime where this is enabled) for use within the runtime. Whether this is a good idea, or ultimately just as awkward as explicit dynamic compilation, remains to be seen. But it does appeal to me. It allows appropriately coarse-grained compilation for reusable components, and doesn’t hinder performance when interpreting code. It can also enable effective separate compilation, reuse, and code sharing (which might result in better cache behavior).

A potential advantage of this `#foo` approach is that it mitigates a weakness of AO: exponential growth of generated code.

The formal semantics of AO is simply the inline expansion of each word’s definition until there are no more words. All that remains, at that point, is an ABC stream. Keeping these extremely simple semantics is why AO forbids recursive function definitions – there would be no limit to simple inline expansion. Loop behavior is instead expressed non-recursively by use of a fixpoint combinator (which admittedly isn’t very convenient, but developers can build while/until/foreach/etc. loops above fixpoint).

Anyhow, this inline expansion means that generated bytecode tends to grow exponentially in size with depth and layers of abstraction. This can be partially mitigated for streaming code because we can expand lazily. But not all Awelon project subprograms are streamed, and in general exponential expansion can stress system memory and CPU cache. By utilizing precompiled words at expansion time (as opposed to dynamic compilation, which applies after expansion) it seems reasonable that stress on memory and cache could be reduced. Further, I could integrate this with ABC’s own approach to separate compilation and linking.

Awelon project has a non-conventional and interesting approach to separate compilation, linking, and resource sharing:

Instead of importing human-named modules and functions, we can identify a software component by the secure hash of its bytecode. For example `{#ngv6gnrqki5vgnyxtodjorlrwoiinmfk74gg3ftwj7byrqratmlomruoce57}` might identify bytecode that implements a fibonacci function. Theoretically, there is an opportunity here for collision in names. But, in practice, that probability can be reduced well below the threshold of unlikely problems like cosmic rays and meteor strikes. The convention is that this invocation shall search for a resource, download and validate it if necessary, then cache and compile locally. But this convention can be tweaked for circumstance, e.g. shifting the compilation task to a trusted proxy. If the resource cannot be obtained, the program simply fails. If space is needed, old or infrequently used resources (or perhaps just their compiled objects) can be discarded from cache. Search locations for download might be extended by use of annotations.

Use of a secure hash instead of human names avoids a lot of problems: update and versioning concerns are cleanly separated, there is no cache invalidation, we can download the code from anywhere, there is no trouble using proxy repositories. The design is highly compatible with open and distributed systems and streaming code. The hash can even serve as a secure capability of sorts, depending on the protocol by which we download the resource. I believe Awelon’s design is theoretically superior to most conventional approaches, but it has yet to be implemented.

In context of precompiled words, it seems feasible to start using this technique directly: code of the form `#foo` might compile to a `{#secureHash}` resource instead of expanding fully into the underlying ABC. (For debugging, this might become `{#secureHash (#foo)}`.) This would make concrete what’s happening with respect to space savings, separate compilation, and reuse, i.e. because each compiled element has a fixed size in the ABC stream. Providing developers the control here could prove immediately useful. Long term, however, division of a program into optimal reusable subcomponents might be mostly automated.

So… coming back to JIT.

The earlier difficulty with JIT for ABC was the lack of a stable location or names. However, these precompiled words should serve as moderately effective JIT targets. Naively, we might simply compile each word the first time it is required. But more sophisticated approaches involving counters or whatnot are certainly feasible.

I think I’ll give these precompiled words a try. Reviewing this article, it seems I favor the `#` prefix for precompiled words, perhaps due to the allusion to `{#secureHash}` resources. Hopefully, this really will help with performance by reducing the memory and cache burdens.

Addendum 30 June

I’ve played around with using word prefixes to guide compilation, but I find this fairly inconvenient and rigid, requiring too much editing of the dictionary to test slightly different configurations. An alternative I’m now considering is to use a separate configuration value, which simply contains a list of words to precompile (one per line).

Posted in Language Design, Modularity, Open Systems Programming | Tagged , | Leave a comment