ABC Linking with Provider Independent Security

As I’ve mentioned in other posts, Awelon Bytecode (ABC) has a non-conventional approach to code reuse. A valid sequence of bytecode can be given a deterministic, cryptographically unique name – e.g. a secure hash. In ABC, the code to invoke this external bytecode resource might be represented with `{#secureHashOfBytecode}`. Logically, the identified bytecode is inlined in place. In practice, this indirection provides a fine opportunity for separate compilation and caching.

In case of distributed systems, a secure hash offers a potential advantage of location independence. You can download the identified code from anywhere then validate it against the secure hash. Unfortunately, the code itself is exposed to whomever stores it, and in general code can contain IP sensitive information. So we cannot upload the code to anywhere. With just a secure hash, we’re restricted to trusted hosts for storing code. I would prefer a property the Tahoe-LAFS community describes as provider independent security – i.e. that the host cannot inspect the data. This way, we can cheaply rent a few untrusted cloud servers to handle the bulk of data distribution.

It turns out that only a small tweak to ABC’s naming system is needed to support provider independence:

  • use the secure hash of the bytecode as a symmetric encryption key
  • use the secure hash of the cipher text as the lookup key
  • store the cipher text, indexed by its lookup key
  • access via `{#secureHashOfCipherText:secureHashOfBytecode}`

I’ve studied computer security a lot from the PL perspective, but I don’t consider myself an expert in cryptography. To help validate this solution, I compare it to Tahoe’s approach. Essentially, I need a tiny subset of Tahoe’s features: I don’t need mutable files, I don’t need erasure coding or K-of-N recovery, and I don’t large multi-segment files (I can decompose code at the ABC layer).

Tahoe’s documentation describes: “The capability contains the encryption key, the hash of the Capability Extension Block, and any encoding parameters necessary to perform the eventual decoding process. For convenience, it also contains the size of the file being stored.” And here the Capability Extension Block is derived from hashing the ciphertext – albeit, with a layer of indirection to support incremental loading of multi-segment files.

ABC’s model for a bytecode resource capability is very similar. The main difference is that ABC’s encryption key is determined from the bytecode content. And there is no capability extension block indirection, no segmentation at that layer. If there is a weakness in this design, it’s because the encryption key is easily computed from content. But, assuming the hash really is secure and independent of the encryption algorithm, I don’t believe that’s a feasible attack.

Having the final capability text be deterministic from content is valuable for interning, caching, reuse of popular software components (templates, frameworks, libraries, etc.). I would be reluctant to give it up.

Recently, I’ve been deciding on a few concrete encodings.

I’m thinking to use SHA3-384 as underlying secure hash algorithm, but using the first 128 bits for secureHashOfCipherText and the last 256 bits for secureHashOfBytecode. This should maintain the full original uniqueness; certainly, if our encryption function was identity, we’d certainly recover the original 384-bit secure hash. Those secure hash bits each will be encoded in ABC’s base16 to leverage a special compression pass for embedded binaries. At this point I favor AES-CTR (with zero nonce) for encryption of the bytecode.

Compressing the bytecode prior to encryption is potentially a worthy investment. Doing so could save in storage and network overheads. It’s infeasible to compress after encryption. And a primary consideration is that the compression algorithm should be highly deterministic (almost no room for choices at compression time). A special compression pass will allow embedding binaries in ABC, and a more conventional compression pass will tighten up repeated sequences of bytecode. (Worst case expansion for uncompressible large binaries should be under 2.5%.)

Anyhow, my original plan was to just use `{#secureHash}` resources and use a distinct encoding and pipeline for the encrypted variation. Now, I’m leaning towards just using the encrypted, provider-independent security naming convention at all times, even when the object never leaves the local machine. Decryption, decompression, and compilation are essentially one-time costs (if we use a good cache) and compilation will tend to dominate. The overhead for provider independent security is essentially negligible.

Convergence Secrets (Addendum)

Zooko – a primary developer on Tahoe-LAFS – notes in comments below that convergent encryption has a known security weakness to ‘confirmation’ attacks if the objects contain sensitive low-entropy information. Actually, confirmation attacks are a potential problem even without the encryption, but is a few orders of magnitude more so with encryption because we’ll be exposed to ‘offline’ attacks. Still, convergence is valuable in the common case for Awelon project (e.g. caching and reusing popular library code), so I’d be reluctant to abandon it.

Fortunately, this can be addressed by adding some artificial entropy. This can be done on a per-resource basis, so long as we have some policy to identify which resources are sensitive and thus should be protected. (In context of Awelon Object (AO) language, perhaps a dictionary word `foo` might be considered sensitive whenever a directive word `secret!foo` is defined.) Unlike Tahoe-LAFS, the secret doesn’t need to be part of the protocol; it isn’t a problem to compile it directly into the bytecode, e.g. by simply adding a random text literal then dropping it:

"bpgzmjkydmfyfdhdhptzmnbpdjkndjnsgdjyzpxbsypshqfk
~%

Of course, there are many advantages to determinism. So, rather than random texts, I might instead favor a secret per user and use an HMAC style mechanism to deterministically generate a convergence secret per unique sensitive resource.

Advertisements
This entry was posted in Distributed Programming, Language Design, Modularity, Open Systems Programming, Security and tagged , , , . Bookmark the permalink.

5 Responses to ABC Linking with Provider Independent Security

  1. Pingback: LZW-GC Streaming Compression | Awelon Blue

  2. Neat! Thanks for posting this blog post. I’m proud to be part of the vector of these ideas to you. (We got our ideas from Ian Clarke’s FreeNet, David Mazières Self-certifying Filesystem, Eric Hughes’s unpublished works, Sean Quinlan et al.’s Fossil/Venti, etc.)

    I don’t understand the part about using 192-bits of SHA-384 for secureHashOfCipherText and the other 192-bits for secureHashOfBytecode, because surely you need to run the hash function twice, each time with a different input, to generate those two different things.

    There are some subtle and potentially dangerous issues with convergent encryption such as you describe. Please see our write-up about that: https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html

    By the way, if I were starting today, I would use BLAKE2 as my secure hash function: https://leastauthority.com/blog/BLAKE2-harder-better-faster-stronger-than-MD5.html

    You didn’t specify the details of the encryption — typical encryption algorithms such as AES-CBC or ChaCha20 require a “nonce” or “IV” in addition to a key, with a contraint that if you ever re-use an identical (key1, nonce1) combination with different messages, e.g. (key1, nonce1, message1) and also (key1, nonce1, message2), then all bets are off and your confidentiality can fall to pieces.

    • dmbarbour says:

      I do run the hash function twice. The idea of using different halves is simply that it would work even if the encryption function happened to trivially be ‘id’, so I can be pretty sure it will work for a more sophisticated encryption function. I could run a 192 bit hash twice. But my approach has me more confident of independence.

      Nonces aren’t a concern here. Already, every ‘message’ is encrypted with a unique key (the secure hash of the message), so there is no risk of using ‘key1’ twice. I could perhaps use the first half of the secure hash as a nonce, but it isn’t essential. Also, I understand that nonces are to resist ‘replay’ attacks. Given that the message ‘download this specific immutable resource’ is idempotent, commutative, cacheable, proxyable, and effectively pure (modulo network disruption), I think there is no issue with replay. 🙂

      (Note: I haven’t fully decided cryptographic details; I do know I’d like to stick with widely accepted standards, if only to simplify specification.)

      Thanks for making me aware of the convergent encryption attack. I vaguely remember reading that article years ago, but it had long since slipped my mind.

      Convergent encryption serves an important role. It is not uncommon to generate the same modules from the same initial source code – times many thousands of users. I do not wish to store or download identical resources many thousand times. The whole purpose of ABC’s linking model is to better reuse storage and network, similar to interning of strings. Otherwise, you could just stream the entire program… albeit, with exponential redundancy.

      Leveraging a reusable module system to hide privacy sensitive information seems rather awkward, and I’ll further caution against it. But I wouldn’t be surprised if a few people tried it anyway. In that case, they could explicitly defeat convergent encryption by embedding a random string in the ABC resource, then simply dropping it. I will recommend other, more conventional approaches to protecting private data, such as value-level sealing and rights amplification… or simply not sharing it in a one-to-many service.

  3. Pingback: Embedded Literal Objects | Awelon Blue

  4. Pingback: Code is Material | Awelon Blue

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s