-1

I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.

I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:

base_dict = {
    'A' : 0,
    'T' : 1,
    'C' : 2,
    'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value

strings_list = [ 
'ATCG', 
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG' 
] # A few sample DNA sequences

for string in strings_list:
    encoded_number = 0
    for i in range(len(string)):
        letter = string[i]
        encoded_number += (4**i) * base_dict[letter]
    print('String {} = {}'.format(string, encoded_number))

It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:

binary = '{0:b}'.format(encoded_number)

But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):

bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer

I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?

Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?

It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).

I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.

TL;DR, summarizing the several questions I'm having trouble with:

  1. What is the most efficient way to encode limited character strings? (There's an example in the code above, is that the best way?)
  2. Is there any other ways to compress strings that could be used alongside the encoding of limited characters, to further compress the data?
  3. Can large integers (4^300) be converted into binary without resulting in an Overflow? How?
  4. What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)
  • 1
    what, **exactly** is throwing this error: `OverflowError: cannot fit 'int' into an index-sized integer`? – juanpa.arrivillaga Oct 08 '19 at 16:02
  • 1
    Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [Minimal, complete, verifiable example](https://stackoverflow.com/help/minimal-reproducible-example) applies here. We cannot effectively help you until you post your MCVE code and accurately specify the problem. We should be able to paste your posted code into a text file and reproduce the problem you specified. Nothing in your posted code throws the error fragment you posted. – Prune Oct 08 '19 at 16:04
  • 1
    What have you researched in binary input/output? It seems that your problem is in the writing, not in the conversion. You failed to post that code. – Prune Oct 08 '19 at 16:07
  • Sorry, edited the question. I was simply trying to use the 'bytes' class: bytes(encoded_number) – Guilherme Coppini Oct 08 '19 at 16:08

1 Answers1

2

The conversion you're making is the obvious one: since 4 is a power of 2, the conversion to binary is as compact as you can get for uniformly-distributed sequences. You need only to represent each letter with its 2-bit sequence, and you're done with the conversion.

Your problem seems to be in storing the result. The shortest change is likely to upgrade your code using bytes properly.

Another version of this is to break the string into 8-letter chunks, turning each into a 32-bit integer; then write out the sequence of integers (in binary).

Another is to forget the entire conversion; feed the string to your system's compression algorithm, which will take advantage of frequent amino acids.

N.B. your conversion will lose leading zeros, such as "AAAAGCTGA"; this would re-constitute as "GCTGA". You'll need to include the expected string length.


For doing the simple chunk-convert method, refer to the link I provided.

For compression methods, research compression (which we presume you've done before posting here, per the posting guidelines). On Linux, use the file compression provided with the OS (likely gzip).

Another possibility is if you have at least two amino acids that don't appear in your data, code the other triples and use base62 (do a browser search for documentation) -- this uses the full range of alphanumeric characters to encode in text-readable form.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks, didn't realize about the leading zeros. I didn't quite understand how to break the string and 'write the sequence of integers', do you mean in breaking it every 8th letter, converting to binary... then what? How do I concatenate one binary to the other once they are both in binary format? As for taking advantage of frequent amino acids, what sort of compression algorithm would you suggest? I'm using Linux, if that helps. Do you believe using another form of compression would make said process too slow? – Guilherme Coppini Oct 08 '19 at 16:22
  • Also, I tried using the 's = str(n).encode()' code in the link you posted, but the resulting binary seemed to be bigger than the integer (sys.getsizeof(n) returned a smaller value than sys.getsizeof(s)). Is that expected? I was under the impression that the size of the variable should remain the same - and was also expecting to get the binary-value equal to '{0:b}'.format(n) rather than just an encoded string. – Guilherme Coppini Oct 08 '19 at 16:40
  • @GuilhermeCopping [These](https://docs.python.org/3.7/library/archiving.html) are general compression algorithms that have built-in support from Python. I'm not sure whether these can have a better compression rate if those 300-byte strings are compressed one by one. But if they are compressed in bunches (e.g. 100 or more strings concatenated into a bunch) I think a better compression rate might be achieved. As for the performance you could time the program yourself to see whether that fits your need. – GZ0 Oct 08 '19 at 17:30