I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.
I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:
base_dict = {
'A' : 0,
'T' : 1,
'C' : 2,
'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value
strings_list = [
'ATCG',
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG'
] # A few sample DNA sequences
for string in strings_list:
encoded_number = 0
for i in range(len(string)):
letter = string[i]
encoded_number += (4**i) * base_dict[letter]
print('String {} = {}'.format(string, encoded_number))
It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:
binary = '{0:b}'.format(encoded_number)
But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):
bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer
I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?
Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?
It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).
I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.
TL;DR, summarizing the several questions I'm having trouble with:
- What is the most efficient way to encode limited character strings? (There's an example in the code above, is that the best way?)
- Is there any other ways to compress strings that could be used alongside the encoding of limited characters, to further compress the data?
- Can large integers (4^300) be converted into binary without resulting in an Overflow? How?
- What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)