kcrypt2 - a retrospective and a lab notebook.

Origins⌗

In August and September this year I was researching a broad assortment of topics in coding theory and information theory. The most important ones for today’s blog post are:

Reed-Solomon codes, which I was already familiar with. Working on my book de facto forced me to write a few different Reed-Solomon error/erasure correcting code implementations (original view: Berlekamp-Welch decoder and the Gao decoder - I believe that my book will contain the first public and correct implementation of Shuhong Gao’s EEA-based, original view RS decoder; BCH view: Berlekamp-Massey, Peterson-Gorenstein-Ziegler, Yasuo Sugiyama’s EEA decoder).
Goppa codes, which have certain interesting properties in the realm of cryptography.

Note that this blog post is much different from many resources you will find online: I chose to document my exact thought process and evolution of my idea. This means that, in essence, a lot of the ideas you will see that precede the main product simply suck. This is the ugly nature of research that no scientific paper or thesis admits, but in the author’s opinion it carries a lot of value.

Primers of Reed-Solomon⌗

Suppose that we have a $(n,k)$ -RS (Reed-Solomon) code. $k$ is the input block size, $n$ is the output block size. Some commonly used values for these parameters include $n,k=255,223$ (CCSDS). The codeword is formed by interpolating a polynomial in a finite field given the input block as the values, and then evaluating said polynomial at $n$ distinct points.

A polynomial of degree $n - 1$ is uniquely determined by its values at $n$ distinct points. In other words, the mapping from coefficients to values of polynomials is a bijective linear map with the Vandermonde matrix $V$ . Further, the interpolation problem (i.e. the problem of determining the coefficients of a polynomial given its values) has a unique solution.

This theorem (sometimes called by the literature the unisolvence theorem) can be proved using the factor lemma. But most importantly, it tells us something very important: the $(255,223)$ -RS code is the 255 consecutive (wrt. field generator) values of a degree-223 polynomial. Hence, even if all 32 of the parity values were corrupted and we knew their locations (i.e. they were known erasures), we would be able to restore the original data.

Unfortunately, real life is not so beautiful and we usually don’t know where exactly errors occur. Various algorithms for decoding Reed-Solomon error-correcting codes have been proposed. However, for the $(255,223)$ -RS code, the PGZ and BW algorithms can correct only up to $\lfloor (n-k)/2 \rfloor$ errors. In general, this is related to the fact that linear block code decoding, and more specifically polynomial reconstruction are a NP-hard problem: to decode the RS-code up to its capacity (which is 32 errors), we would have to try, in the worst case, all of the ${223 \choose 255} = 50'964'019'775'576'912'153'703'782'274'307'996'667'625$ different possible options. Some recent reserach into list decoding and related methods makes it more realistic to correct more than 16 errors in such a code, but we will not spend too much time talking about that.

Applications to cryptography?⌗

When I finished the RS-codes section in my book, I thought that Reed-Solomon codes could make for a great cryptographic primitive: easy to encode, difficult to decode up to capcacity if you don’t know the error location. Perfect?

My first algorithm used MT19937 to generate the positions of the error locations. The key was the MT19937 seed, which made it possible for the holders of the key to reconstruct the corruption sequence. Then, after encoding the input blocks with a special variant of the RS-code, the randomly selected values were overwritten with the contents of /dev/random. The modification that I made to RS-codes concerned the fact that they are nominally systematic code (i.e. the input is a part of the codeword - a desirable feature of error and erasure-correcting codes). To prevent leaking information about the plaintext, I have decided to interpolate the polynomial and then perform the evaluation at a distinct set of points.

It didn’t take me long to find problems with this idea:

MT19937 is not a cryptographically secure pseudo-random number generator. If there was a way for the attacker with an encryption oracle to extract some of the consecutive outputs of the random number generator, the key is compromised and the scheme is insecure. There are ways to make MT19937 work in a cryptosystem (more on that later), but this is the most glaring issue to most cryptography experts.
Low variance inputs result in low variance outputs. Interpolating a polynomial through a zero polynomial (which is what effectively happens when we encrypt a block full of zeroes) obviously ends up with us sampling the parity bits as zero too. Hence the corruptions in the ciphertext are very obviously visible and it is easy to relate the ciphertext to the plaintext. Further, it is also easy to recover the key, as we have some degree of knowledge about the MT19937 outputs that have lead to some particular choice of values to corrupt.
My implementation of Lagrange interpolation in $\mathbb{F}_{256}$ was cubic-time. While Lagrange interpolation is unavoidably cubic time for floating point computations, an optimisation to it can be made in finite fields to reduce the running time to quadratic. Such an optimisation is not feasible in scientific computing, as it leads to a catastrophic loss of precision.
The output is expanded by the presence of the parity bits. Such eccentric behaviour is exhibited by some existing ciphers (e.g. ElGamal encryption), but I have considered it a big flaw from the very start.

Soon after I have fixed the second problem producing the canonical “kcrypt2” implementation that I blogged about a while ago that appended a key-derived nonce to a segment of the polynomial values before interpolation. As always I am skimming over a few details, because detailed explanations are very tiresome to write, so please refer to the previous blog post for how the program actually worked.

Fixing the problems, one by one.⌗

The 2x expansion problem was so bad that it didn’t let me sleep. Thankfully, I found a solution to it: instead of just sampling the polynomial and then corrupting parts of it, what if we sampled it at a random subset of points given by the key? This way, the corruptions become implicit: we transmit only the correct values, but we don’t inform the other party about what points these values have been sampled from (they’d need a key to determine that).

Another problem that I wanted to tackle was the MT19937 insecurities. The Mersenne Twister FAQ hosted by the mathematics department of the Hiroshima university says that applying a compression function to a set of MT19937 outputs (e.g. like the RIPEMD160 compression function - “one way”, non-bijective mapping) makes it effectively a cryptographic random number generator. Indeed, if we wanted to use a compression function on existing non-CSPRNG output, Mersenne Twister is actually one of the best choices thanks to its large period.

The game plan for the compression function was a set of matrix products, transpositions and row rearrangements that are difficult to invert.

kcrypt2b, demonstrated.⌗

The prelude, containing MT19937 code and the finite field code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
// ========================================================
// Prelude.
// ========================================================
#define Fi(n, a...)for(int i=0;i<n;i++){a;}
#define Fj(n, a...)for(int j=0;j<n;j++){a;}
#define Fk(n, a...)for(int k=0;k<n;k++){a;}
#define Fid(n, a...)for(int i=n-1;i>=0;i--){a;}
#define Fid1(n, a...)for(int i=n-1;i>0;i--){a;}
#define Fj(n, a...)for(int j=0;j<n;j++){a;}
#define XCH(k,x,y){k t=x;x=y;y=t;};
#define _(a...) {return({a;});}
#define S static
#define V void
// GF(2^8) generated by 0x1D, MT19937.
typedef uint8_t u8;   typedef uint16_t u16;
typedef uint32_t u32; typedef uint64_t u64;
S u8 M[256][256], D[256], LOG[256], EXP[510];   S FILE * rng;
S V init() {
  int b = 1;  rng = fopen("/dev/random", "rb");
  Fi(255, LOG[b] = i;  EXP[i] = EXP[i + 255] = b;
          if ((b <<= 1) >= 256) b = (b - 256) ^ 0x1D)
  Fi(256, Fj(256, if (i && j) M[i][j] = EXP[LOG[i] + LOG[j]]))
  Fi(256, int d = LOG[1] - LOG[i]; D[i] = EXP[d < 0 ? d + 255 : d])}
S int cleanup()_(fclose(rng))
typedef struct { u32 mt[624], mti; } key;
S u32 mtn(key * k) {
  u32 y, mag01[2] = { 0, 0x9908b0df };
  if (k->mti >= 624) {
    Fi(227, y = (k->mt[i] & 0x80000000) | (k->mt[i + 1] & 0x7fffffff);
            k->mt[i] = k->mt[i + 397] ^ (y >> 1) ^ mag01[y & 1])
    Fi(396, y = (k->mt[i + 227] & 0x80000000) | (k->mt[i + 228] & 0x7fffffff);
            k->mt[i + 227] = k->mt[i] ^ (y >> 1) ^ mag01[y & 1])
    y = (k->mt[623] & 0x80000000) | (k->mt[0] & 0x7fffffff);
    k->mt[623] = k->mt[396] ^ (y >> 1) ^ mag01[y & 1]; k->mti = 0;
  }
  y = k->mt[k->mti++];         y ^= y >> 11;
  y ^= (y << 7) & 0x9d2c5680;  y ^= (y << 15) & 0xefc60000;
  return y ^ (y >> 18);}

The compression function for MT19937 outputs:

// ========================================================
// http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/efaq.html
// ========================================================
S void rotl16(u32 * r, int d) { u32 t[16]; Fi(16, t[i] = r[(i + d) % 16]); memcpy(r, t, 64);}
S u32 triple32(u32 x) { // exact bias: 0.020888578919738908
  x ^= x >> 17; x *= 0xed5ad4bbU;
  x ^= x >> 11; x *= 0xac4c1b51U;
  x ^= x >> 15; x *= 0x31848babU;
  x ^= x >> 14; return x;}
// Nyberg K. (1991) Perfect nonlinear S-boxes
// TODO: Use MDS code matrix for diffusion?
#define ROTL8(x,s) ((u8) (x << s) | (x >> (8 - s)))
S V initialize_sbox(u8 sbox[256]) {
  u8 p = 1, q = 1;
  do {
    p = p ^ (p << 1) ^ (p & 0x80 ? 0x1B : 0);
    q ^= q << 1; q ^= q << 2; q ^= q << 4;
    q ^= q & 0x80 ? 0x09 : 0;
    u8 xformed = ROTL8(q, 1) ^ ROTL8(q, 2) ^ ROTL8(q, 3) ^ ROTL8(q, 4);
    sbox[p] = q ^ xformed ^ 0x63;
  } while (p != 1);
  sbox[0] = 0x63;}
// Diffuse/compress.
S V mtprod(u32 r[16][16], u8 sbox[256]) {
  u32 temp[16][16] = { 0 };
  Fi(16, Fj(16,
    u32 sum = 0;
    Fk(16, sum += r[i][k] * sbox[k * 16 + j]);
    temp[i][j] = sum;))
  Fi(16, Fj(16, r[i][j] = temp[i][j]))}
S V mttrans(u32 r[16][16]) {
  u32 temp[16][16] = { 0 };
  Fi(16, Fj(16, temp[i][j] = r[j][i]))
  Fi(16, Fj(16, r[i][j] = temp[i][j]))}
S V mtcrypt(key * k[16], u8 padding[16], u8 sbox[256]) { // diag(R) = diag((R' * sbox)^r * sbox)
  u32 r[16][16]; Fi(16, Fj(16, r[i][j] = triple32(mtn(k[i])))) mtprod(r, sbox);
  Fi(16, rotl16(r[i], i)) mttrans(r); mtprod(r, sbox); Fi(16, padding[i] = ((r[i][i] >> 16) ^ r[i][i]) & 0xFF);}

We will interject at this moment. How can we be sure that this compression function is actually secure? While there are many premises that would support this theory, we will first try to make sure that its explicit formula performs sufficient amounts of diffusion. For this purpose, I wrote a small Python script using z3 for analysing this particular compression function:

from z3 import *
r = [[BitVec(f'r_{i}_{j}', 32) for j in range(16)] for i in range(16)]
def initialize_sbox():
    sbox = [0] * 256
    p, q = 1, 1
    while True:
        p = p ^ (p << 1) ^ (0x1B if (p & 0x80) else 0)
        p &= 0xFF
        q ^= q << 1
        q ^= q << 2
        q ^= q << 4
        q ^= 0x09 if (q & 0x80) else 0
        q &= 0xFF
        xformed = ((q << 1) | (q >> (8 - 1))) ^ ((q << 2) | (q >> (8 - 2))) ^ \
                  ((q << 3) | (q >> (8 - 3))) ^ ((q << 4) | (q >> (8 - 4)))
        xformed &= 0xFF
        sbox[p] = q ^ xformed ^ 0x63
        if p == 1:
            break
    sbox[0] = 0x63
    return sbox
sboxvec = initialize_sbox()
sbox = [BitVecVal(sboxvec[i], 32) for i in range(256)]
def symbolic_mtprod(r, sbox):
    temp = [[BitVec(f'temp_{i}_{j}', 32) for j in range(16)] for i in range(16)]
    for i in range(16):
        for j in range(16):
            temp[i][j] = Sum([r[i][k] * sbox[k * 16 + j] for k in range(16)])
    return temp
def trans16(r):
    return [[r[j][i] for j in range(16)] for i in range(16)]
def symbolic_rotl16(row, d):
    return [row[(i + d) % 16] for i in range(16)]
sbox_initialized = sbox
temp1 = symbolic_mtprod(r, sbox_initialized)
rotated = [symbolic_rotl16(row, i) for i, row in enumerate(temp1)]
temp2 = symbolic_mtprod(trans16(rotated), sbox_initialized)
result = [((temp2[i][i] >> 16) ^ (temp2[i][i])) & 0xFF for i in range(16)]
simplified_results = [simplify(result[i]) for i in range(16)]
print(simplified_results)

The simplified outputs made it clear that each cell in the result vector is a (rather complicated) polynomial expression in 64 variables and at the time, I believed that inverting it across paddings to obtain consistent results would be difficult enough. In retrospect, I think that this step was either unnecessary, or I should have used a more complex compression function.

When it comes to these, it is very easy to make a mistake. For example, the CryptMT variant of Mersenne Twister that is cryptographically secure uses a similar idea, but the properties of its compression function were not very well understood, hence it was rejected in the eSTREAM competition.

Finally, the block cipher part was slightly simpler than before, using the $\mathcal O(n^2)$ -optimised version of Lagrange interpolation:

// ========================================================
// Block cipher
// ========================================================
S V lagrange(u8 * x, u8 * y, int n, u8 * coef) {
  memset(coef, 0, n); // Data Compression in Depth, Kamila Szewczyk, pp. 288-290, O(n^2).
  u8 c[n + 1]; memset(c, 0, n + 1); c[0] = 1;
  for (int i = 0; i < n; i++) {
    for (int j = i; j > 0; j--)
      c[j] = c[j - 1] ^ M[c[j]][x[i]];
    c[0] = M[c[0]][x[i]]; c[i + 1] = 1;
  }
  u8 P[n]; memset(P, 0, n);
  for (int i = 0; i < n; i++) {
    u8 d = 1, t;
    for (int j = 0; j < n; j++)
      if (i != j) d = M[d][x[i] ^ x[j]];
    t = M[D[d]][y[i]];
    coef[n-1] ^= M[t][P[n-1] = 1];
    for (int j = n - 2; j >= 0; j--)
      coef[j] ^= M[t][P[j] = c[j+1] ^ M[x[i]][P[j+1]]];
  }
}
S u8 horner(u8 * coef, int n, u8 x) {
  u8 result = coef[n];
  for (int i = n - 1; i >= 0; i--)
    result = M[x][result] ^ coef[i];
  return result;
}
S V fiya(u8 vec[], int n, key * k[16]) {
  int sel = 0, j;
  Fid1(n, j = mtn(k[sel++ % 16]) % (i + 1); XCH(u8, vec[i], vec[j]))}
#define FIYA_EP u8 ep[224]; Fi(224, ep[i] = i + 32) fiya(ep, 224, k)
S V kc2bE(u8 blk[16], key * k[16], u8 sbox[256]) {
  u8 x[32], y[32], coef[32];
  Fi(32, x[i] = i) Fi(16, y[i] = blk[i]) mtcrypt(k, &y[16], sbox);
  lagrange(x, y, 32, coef);
  FIYA_EP; Fi(16, blk[i] = horner(coef, 31, ep[i]));}
S V kc2bD(u8 blk[16], key * k[16], u8 sbox[256]) {
  u8 x[32], y[32], coef[32];
  Fi(16, x[i] = i + 16) mtcrypt(k, y, sbox);
  FIYA_EP; Fi(16, x[i + 16] = ep[i], y[i + 16] = blk[i]) 
  lagrange(x, y, 32, coef);
  Fi(16, blk[i] = horner(coef, 31, i));}
// ========================================================
// File encryption/decryption
// ========================================================
S V kc2bEF(FILE * in, FILE * out, key * k[16]) {
  u8 sbox[256], buf[16] = { 0xAA };  initialize_sbox(sbox);
  long sz; while((sz = fread(buf, 1, 16, in)) > 0) {
    if (sz < 16) buf[15] = sz; kc2bE(buf, k, sbox);
    fwrite(buf, 16, 1, out);  memset(buf, 0xAA, 16);
  }}
S V kc2bDF(FILE * in, FILE * out, key * k[16]) {
  u8 sbox[256], buf1[16], buf2[16], sz;  initialize_sbox(sbox);
  if (fread(buf1, 16, 1, in) != 1) return;
  for (; fread(buf2, 16, 1, in) == 1; memcpy(buf1, buf2, 16)) {
    kc2bD(buf1, k, sbox);
    fwrite(buf1, 16, 1, out);
  }
  kc2bD(buf1, k, sbox); sz = buf1[15]; if (sz > 16) sz = 16; fwrite(buf1, sz, 1, out);}

More details regarding this variant are available in the repository.

kcrypt2c⌗

The main difference between kcrypt2b and kcrypt2c was me succumbing to an existing, high quality compression function. I decided to use RIPEMD160.

kcrypt2d⌗

At this point, I have realised (too late!) that all symmetric block ciphers are merely a clever way of combining the key and the plaintext. While AES accomplishes this through a set of rather obscure and difficult to explain operations, kcrypt2 achieves this through polynomial interpolation. Thinking more about this made me realise that many things about my cipher were excessively defensive: AES keys are small, AES is not IND-CPA and it aims to provide significantly less than my original idea. Is it time to backtrack?

The new idea (and the last one I present in this blog post) is simple:

Take the RIPEMD160 compression function which is a mapping 64B -> 16B.
Generate a 60 byte key.
For each block compute ripemd160_compress(key + uint32(block_number)) - this is our padding. That gives us 16B of data that is derived from the key in an irreversible form and different for each block.
Interpolate 16B of data at $[0,16)$ and the 16B of key-derived nonce at $[16,32)$ to form a degree-32 polynomial.
Sample it at 16 points at $[32,48)$ . That is our ciphertext.

To decrypt the data, the code simply regenerates the block padding based on the key and interpolates the ciphertext and padding to get the data back. An implementation of that is found here.

Wrapping up⌗

Exploring cryptography by trying to make my own symmetric cipher was very fun. I want to look into turning kcrypt into an assymmetric cipher (just for the fun of it!) and experiment with other existing algorithms. I hope that you enjoyed this very brief overview of my somewhat original idea of applying polynomial interpolation to cryptography.

Regarding security: some of the variants are probably more secure than others, but the overall security for data at rest remains an open question. If you would like to conduct cryptoanalysis of the strongest (in your opinion) variant, make sure to send me an e-mail.