1

I am looking to convert a file to binary for a project, preferably using Python as I am most comfortable with it, though if walked-through, I could probably use another language.

Basically, I need this for a project I am working on where we want to store data using a DNA strand and thus need to store files in binary ('A's and 'T's = 0, 'G's and 'C's = 1)

Any idea how I could proceed? I did find that use could encode in base64, then decode it, but it seems a bit inefficient, and the code that I have doesn't seem to work...

import base64
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
    encoded = base64.b64encode(f.readlines())
    print(encoded)

Also, I already have a program to do that simply with text. Any tips on how to improve it would also be appreciated!

import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','') 
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)

For example, if I input test: ok so for the text to DNA: I input 'test' and expect the DNA sequence that comes from the binary the binary being: 01110100011001010111001101110100 (Also I asked to print every conversion in the example so that it is more comprehensible)

>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G
user3666197
  • 1
  • 6
  • 50
  • 92
Oscar B
  • 31
  • 1
  • 1
  • 7
  • 4
    If you're going from four characters to two, aren't you inevitably losing information? How can you get it back again? – jonrsharpe Oct 26 '15 at 16:26
  • Do you mean because we are using A and T for 0 and G and C for 1? – Oscar B Oct 26 '15 at 17:07
  • Well since the information at the beginning is in binary I don't see how that would make us lose information (I'm maybe not explaining it well...) – Oscar B Oct 26 '15 at 17:08
  • I'd say definitely not. Could you give a [mcve], including sample inputs and expected and actual outputs? – jonrsharpe Oct 26 '15 at 17:10
  • [**Edit the question**](http://stackoverflow.com/posts/33350667/edit), you donut! – jonrsharpe Oct 26 '15 at 17:17
  • How is it *"DNA"* if you only ever have As and Gs? 1 bits gives you 0-1 but 2 bits gives you 0-3, so you could iterate the binary in pairs and use all four bases. – jonrsharpe Oct 26 '15 at 21:04
  • Well basically, we would then be synthesizing this strand, to be able to decode it later. So here I only want one strand (I'm not sure if this is what you meant). Also we are not trying to make DNA that is usable, it is just meant to be some way to store the information – Oscar B Oct 26 '15 at 21:07
  • No, my point is that your example would be e.g. `'CTCACGCCCTATCTCA'` rather than `'AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA'` (both half as long and *more like actual DNA*) if you encoded **pairs** of binary digits to the four bases, instead of single binary digits to only two of them. Powers of two are important in computering! – jonrsharpe Oct 26 '15 at 21:12

2 Answers2

2

So, thanks to @jonrshape and Sergey Vturin, I finally was able to achieve what I wanted! My program asks for a file, turns it into binary, which then gives me its equivalent in "DNA code" using pairs of binary numbers (00 = A, 01 = T, 10 = G, 11 = C)

import binascii
from tkinter import filedialog

file_path = filedialog.askopenfilename()

x = ""
with open(file_path, 'rb') as f:
    for chunk in iter(lambda: f.read(32), b''):
        x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
    if i == "00":
        dna += "A"
    elif i == "01":
        dna += "T"
    elif i == "10":
        dna += "G"
    elif i == "11":
        dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"
Oscar B
  • 31
  • 1
  • 1
  • 7
0

Of course, it is inefficient!
base64 is designed to store binary in a text. It makes a bigger size block after conversion.

btw: what efficiency do you want? compactness?

if so: second sample is much nearer to what you want

btw: in your task you loose information! Are you aware of this?

Here is a sample how to store and restore.

It stores data in an easy to understand Hex-In-Text format -- just for the sake of a demo. If you want compactness - you can easily modify the code so as to store in binary file or if you want 00011001 view - modification will be easy too.

import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
    .replace('0','A').replace('1','T').replace('2','G').replace('3','C')

def store_(s):
    size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
    s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
        .ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
    a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
    return ''.join(a),size

yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore


def restore_(s,size=None):
    if size==None: size=len(s)/2
    a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
    #you loose information, remember?, so it`s only A or G
    return (''.join(a).replace('1','G').replace('0','A') )[:size]

restore_(yourDataAsHexInText,sizeToStore)


print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))

result in my test:

63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True
user3666197
  • 1
  • 6
  • 50
  • 92
  • Thanks a lot for your answer! But I am a bit lost, could you please explain a bit more your code? Also what is the input supposed to be here? Because it seems that you are trying inputting only A, T, G and C – Oscar B Oct 26 '15 at 18:16
  • maybe i not so good understand, but yes- this sample wait a string of "A, T, G and C" if you want to binary to dna as in your sample then you could use only modified restore_ (just modify to bin instead of hex) explain: store_ split input string to groups by 8, interpret each group as binary integer value and store in (in hex, but you can store in any format you want). restore_ interpret evert 2-symbol fragment as integer (here you can change to any format you want)- and convert it back. – Sergey Vturin Oct 26 '15 at 18:30
  • Oh. So to be concise: I want to convert a file or a text to DNA. So like on the example I enter 'test' and it returns the 'equivalent' in DNA. That is what I want for files. So I would need to have the file turned into binary to be able to convert it into DNA – Oscar B Oct 26 '15 at 18:39
  • it's meaningless in my point of view, but it's easy. if consider every char of input string as binary. // s="test" (''.join((bin(ord(x))[2:].rjust(8,'0') for x in s)).replace('1','G').replace('0','A') ) // you can use that isolated or like in a sample- it is modified second string of restore_ – Sergey Vturin Oct 26 '15 at 19:43
  • http://stackoverflow.com/questions/1035340/reading-binary-file-in-python-and-looping-over-each-byte – Sergey Vturin Oct 26 '15 at 20:16
  • @OscarB this isn't a tutorial service, please go and do some actual research rather than hassling the author. You're starting to seem like a [help vampire](http://meta.stackexchange.com/questions/19665/the-help-vampire-problem), which is **not** a good look. – jonrsharpe Oct 26 '15 at 21:15
  • You are absolutely right @jonrshape . I didn't realize that I was starting to ask too much, especially since I was in fact able to find out later what he meant exactly using google... I'm sorry though if you feel like I have been wasting your time, but your idea of using pairs is really useful for me and thanks to what Sergey Vturin said I am now able to get a hexdump, which I'll later convert to binary. In the meantime, thanks for the help! Also, it was my fault form the beginning, I should have been more concise in my question... – Oscar B Oct 26 '15 at 21:48