0

I'm looking for an python erasure coding library that works for larger inputs. So far I've checked out:

  • unireedsolomon: fails for 256-byte inputs, unmaintained
  • reedsolo/reedsolomon: fails for a 300-byte input silently.
  • Reed-Solomon clearly a learning project, bug tracker disabled
  • pyeclib: fails for 100-byte input using reed-solomon encoding, and doesn't seem to provide any documentation on valid parameters, such that I couldn't figure out how to test other algorithms (nor does liberasurecode)

I want something that can handle n=10,000 k=2,000 or so, ideally larger.

Zachary Vance
  • 752
  • 4
  • 18
  • Having done some research, it might be hard to compute primitive polynomials orders of magnitude larger than n=1000, which is a barrier to reed-solomon specifically – Zachary Vance Mar 27 '22 at 02:08

1 Answers1

1

Only the field polynomial has to be prime or primitive, not the generating polynomial. If you wanted a RS(10000, 8000, 2000) (n = 10000, k = 8000, n-k = 2000) code, GF(2^16) with primitive reducing polynomial x^16 + x^12 + x^3 + x^1 + 1 could be used. The generating polynomial would be of degree 2000. Assuming first consecutive root is 2, then generating polynomial = (x-2)(x-4)(x-8) ... (x-2^2000) (all of this math done in GF(2^16), + and - are both xor). Correction would involve generating 2000 syndromes and using Berlekamp Massey or Sugiyama's extended Euclid decoder. I don't know if there are Python libraries that support GF(2^16).

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Berlekamp%E2%80%93Massey_decoder

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Euclidean_decoder

Large n and k can be avoided by interleaving. Tape drives like LTO treat large data blocks as matrices interleaved across rows (called C1) and down columns (called C2) using GF(2^8). LTO-8 uses RS(249,237,13), which I assume is the ECC used down columns to correct rows. With 32 read|write heads, there's an interleave of 32, probably across rows. I don't know what the RS() code is across rows, or what the interleave down columns is.

rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • This is useful, upvoted, but I am looking for a library, so I won't mark it as solved. – Zachary Vance Mar 28 '22 at 18:04
  • I'm vaguely familiar with interleaving (is "cross-interleaving" the same thing?) from CD/DVD, but it's higher-overhead and more complicated to present to users than "you can lose any 30% of codes", so I'm saving it as option B. – Zachary Vance Mar 28 '22 at 18:04
  • @ZacharyVance - cross interleaving (CIRC) is specific to certain types of media like CD/DVD. For tape drives, an interleave of 8 across a row would have 8 independent RS codewords, the first one at columns 0, 8, 16, ..., the second one at columns 1, 9, 17, ... the eighth one at columns 7, 15, 23, ... . The interleave down rows would follow a similar pattern. LTO tape drives have 8 to 32 read|write heads, and this probably handled with each head read|writing from separate rows in parallel. The ram addressing is setup so that sequential row access is as fast as sequential column access. – rcgldr Mar 28 '22 at 18:22
  • Thanks for the help! Sounds like I/others should look at non-RS approaches, too. – Zachary Vance Mar 28 '22 at 18:29
  • @ZacharyVance - most erasure codes are based on RS. The only difference is typical erasure codes append standard RS parities, while Raid 6 appends RS syndromes. The tape drives I mentioned use RS error correction across rows, but use RS erasure only correction down rows. Some drives will check for an error down rows, but rather than correct it, just mark that row as an erasure and redo | continue erasure only correction. – rcgldr Mar 28 '22 at 19:29
  • @ZacharyVance - rather than having to generate a degree 2000 polynomial to do a proper encode, the code could instead mark the 2000 parity locations as erasures, and do a erasure only correction. Those parity locations could be optionally zeroed out prior to doing an erasure only correction to "encode" a codeword. This is how some early DAT tape drives "encoded" data. Raid 6 is using a similar concept since it encodes with syndromes (which don't need the generating polynomial) instead of parities. – rcgldr Mar 28 '22 at 19:38