3

My aim is to discover a piece of text hidden through AscII 8bits in a very long (>115,000) sequence of DNA.

I've written code to open the file with the DNA in, convert all C's and A's to 0 and all T's and G's to 1. I've then converted this string into AscII characters. Below is my code.

with open("DNAseq.txt") as mydnaseq:
    sequence = mydnaseq.read().replace('\n','')

DNAa = sequence.replace('A','0').replace('C','0').replace('G','1').replace('T','1')
DNAb = ''.join(DNAa)

DNAc = [DNAb[i:i+8] for i in range(0, len(DNAb), 8)]

DNAd = []
for i in DNAc:
    j = int(i,2)
    DNAd.append(j)


DNA1 = []
for i in DNAd:
    if i >= 32 and i <=127:
        DNA1.append(i)

text = []
for i in DNAd:
    j = chr(i)
    text.append(j)

Answer = open("textanswer.txt", 'w')
Answer.writelines(text)
Answer.close()

However I am getting an error;

UnicodeEncodeError: 'charmap' codec can't encode character '\x9e' in position 0: character maps to <undefined>

And I have no clue what this could be. My DNA sequence apparently has a mix of random characters within but a snippet of a play/poem.

I've tested my code with testDNA.txt containing the following;

ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG

This returns (as expected);

Steak Bake

Can anyone shed any light why I'm getting this error with my DNA sequence?

daenwaels
  • 85
  • 1
  • 7
  • Something wrong with configuration https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined (also python2, python3 handles strings differently) – AlanSTACK Dec 14 '17 at 21:36
  • @Wizardey Cheers for the link, I've looked through the suggestions on there but cannot find a hard solve for my situation however. Though may be me not understanding, i'll try again. – daenwaels Dec 14 '17 at 21:41
  • 1
    `DNAd` contains numbers outside the valid ASCII range. But you already filtered out those when you created `DNA1`, so you probably should be looping over `DNA1` to build `text`. – PM 2Ring Dec 14 '17 at 21:41
  • BTW, if you're using Python 3 there's a rather efficient way to do this conversion. – PM 2Ring Dec 14 '17 at 21:57
  • Maybe. ;) There are some techniques that can be used to speed up searches for valid English words embedded in large strings of random-looking letters. Can you put the DNA file somewhere I can download it from so I can run some experiments? I won't be able to do anything straight away, it's getting _very_ late in my timezone. Also, you forgot to mention which Python version you're using. FWIW, my Python 3 code converts the DNA data to text in 4 or 5 lines. – PM 2Ring Dec 14 '17 at 22:21

2 Answers2

4

As I mentioned in the comments, DNAd contains numbers outside the valid ASCII range. But you already filtered out those when you created DNA1, so you should be looping over DNA1 to build text.

However, in Python 3 there's no need to call the chr function on each ASCII code number. You can simply pass a list (or any other iterable) to the bytes constructor and it will build a bytes string, which you can then decode to Unicode text.

Also, rather than using the str.replace method to convert the DNA letters to '0' and '1' chars we can use str.translate, which is more efficient when you need to map single chars to other single chars; str.translate can also delete unwanted characters. In the code below I use it to delete spaces and newlines. I also delete the Unicode Byte Order Mark, which your 'DNAseq.txt' file starts with.

Firstly, here's a demo using the short DNA sequence given in the question.

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

dna = '''\
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
'''

print(dna_to_bytes(dna).decode('ascii'))

output

Steak Bake

To find the message hidden in your DNAseq.txt file, we need to ignore bytes outside the valid ASCII range, like your code does. However, we also need to skip a couple of bits before we start converting blocks of 8 bits to bytes. There are only 8 possible offsets, and since the amount of data isn't huge it was easy enough to discover the correct offset of 2 by trial and error. OTOH, it did take me a little while to think of trying an offset. ;) If we were working with many millions of bytes then we'd probably need to resort to doing statistical analysis to find blocks of bytes that could be valid English.

The following program doesn't bother trying to isolate the hidden message, it's easy enough to spot in the middle of the garbage text. Note that the 1st line of the message is hidden at the end of the previous long line of garbage.

# ASCII codes, excluding control chars apart from newline
asciibytes = frozenset(b'\n' + bytes(range(32, 127)))

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

fname = 'DNAseq.txt'
with open(fname) as f:
    dna = f.read()

b = dna_to_bytes(dna, offset=2)
a = bytes(u for u in b if u in asciibytes)
print(a.decode('ascii'))

output

;J\Zza%_&jHs F0kM:!ZsfCq1)^7!Bg%=8:2eMz(|tl KRS@@9$`!2wAD5@>K~_CA"u_R9<
p?+D*WRCH`=LY/v0&Sl[l|"x1h-_GT!P'36'PS&&<eY5yakZd?$R!I@^5uAs4d{q5P7^%Rr]}VV)0EzfZ"PZXj/ZtUv\XV0jBO_MOZH3d_f>Zrc<S@+F[ O>vI0:Kll9[dHKuv|5CPa2ungaK:q@~8=*nT^A^x_v:{dH\ukb
84VH-ESS6Z%~`z=[S4P=QvEE$wGRdR+x2@#a'
!&:!Ei:ttE;C9MWp:sF
)91J"7c@,2@{0$c,6R0=p.RJawE*U+}}Vo^2Dhf-PAn@O1yPIH~4J9e6H %,3>)@:K(N_o4\`'`;yQ$
?5t'^@W*YlaEI(@CT*H^u.1 czQ*
H`SzD)4W"[\5JEnI0E`N 3[gAP`Ve_mBE\\v!932E&V4sw~*RurKPq2;B*BwF6c-'fJ~<=25=EAea\Qu!:NW:@d'"ZB?q 0D9FrbGm*PLR*^QwCg>,a,U'_-&P!#;h.f3E!jt]
BOGnmt0*#
g'zkeF;g"kBU(/`I1dxO`+0Q=6bqxI_Y\k#?'r'2nfJ"R$<eaw,(<LIUQxMPqsb}Us/ga?/UY3N#<DWh*$ry#BhtOL'+&c.CZ]BpRM1]bEVfhw2aaNGyR4r,V[Bx=`fd+%@eiH-bXv2lYM8gj958PK"XSWT?w_`E;.-`yxxXmIt+THhC4CVT%9-+T;BX0H
9wTnr (\KibvKI:OZUQ <x*"`_9.nc" W"x>A0?4D%=fHpa cvai;+a3\6*@2<@u!x|R0QQJ8|\`jrFPJH!$v=?bXe54[9oTBno
*ly[1EbHPh/Lh8c9*YQ0BR9NI,-q$IR~]$g#%'[,y.8He%e@Pg 9\v(:31wt9>VcP<Dl37`|yIU>nI"ZJ5Q4_}gNzK$.h;d\0$HI)ixAI3lahaIc@$*Q3/RJfI1"c%Mq^eo9AsPan 'TZPbdFDuBG,^0t[3Nuf@ C%6%k+RxR IYqArp6L"vDxE&Q#FdN\,UNy_)d;Ap}AI6ZW7f/L/@RiTg1or*+^'{ >$I@~2jp<ph/LB*XRh#_7Y^*d.fJ[#Odx."v&IYU%:HB4;(iMh[H jAYci5I){_}1A64{/'CRsYWdkP[!h$s"-KmsM+eLa$||N\#H"NYS.[_#+r4?m7*AredM!_%/;tFP#M4hh?kA)Z%zJ3-x]KK.FcAYOHO+dzLD'w|:,>?qG4mU&T+ABFXV@Wa&ER;0zEj.Qi?<tff(*Y)M~rRgWxd^dnlm{ATYy;^a'
[elI[nu/}42#kI$+3w"8pehY7`A<NV5V(J\?z=R-(;*d&\-c?OJ,zcs?`l6QZ5`U2U%m"F&!0 WBOVqeY5*^@j'j(S.a3{1C9&'W,
vo*a!U1]UQcib>%QlI]|B$U/zzQd)_$b f [d_";JgQ P**IFXQ& %* Xa88%T
?er*hM|dq@]5s_5H"#IeTeQ5BR 'vq[E\e&A1ykv4a$~`*hW4tJ.cIwb('rG]y){xxH|Jdc@~-.[{1kAJ VWzVGd&c?<-%Jt>e55eh^LX<%G f,Byg'<#[@.+a (oW*KrSRM`S18#1V\!jC^SW,v1Sc-?s~pcrsaBX``dg1JmzWO^7iw8AAK$^1&7F[W*cSVCuq5iqYayWUpfQG~^B88!gRR!O 
-n"Gq
Rzfn.`w\.3)aNw2\^)ELn%KKDoiF)$b?$>H$7?/eNR=DglRLi49Do\ Tx%@5KK>(jU(D;)iQjC0>T:;J[sxCc`|y+5BnxQ.h8#/@%*1zAVHvFug"Aqe7wG^!D!10-N^Mp) #N'kto)tyXl0W4u[!Hb&dpqFu7P#:Ui\kzVD~ AgV]*Q%X&i#'2yr_TvaGU4PpOVT*x!W4b(py4acV3XId^lIR%b=-
:~EuBmT&$P|W0Ae.lZ"%NlGf/M R)eY,iaJo"
^RT9IBG<xH!I_B EC2@0Oy*";>JA+jyTBx;#Qq5"G7)D0HPEFI6D/#:Nc-DrSVJEeJ$.}M`8Ic9"dda%(2#"~;C)SAqbHYQ"D#O;qWz}>j#u9X1BD

8lNowODQt\v+K+:ELLoW2w9iz!6uY%*71PNX857Dz(vwtLb<Tj`~243q
Gr1urC46'EcVd%/#z6!Fr9omhk{|!,].YM T<j^m0:"9?r{O/9|.4zZ@Pb#E#)[jY\s|I/<m=GJ'<X..nr*Y4v1<RHe>1{`FoBQFhE"d5(eXW,`#OzeC{AKh?[aL+lz+Hw:&2c^sA!$:e)b
4I6DnkgW^1 +*F^.O_oB]]b&^(bW))Ma HQ1P:tE,[,?_xTnq6c?p0er!GRV=u
o8kcT=aJO+$zqN78,yZT@xiBr!G)URJ_gI:($e J3H._5i# pDy(u*-oI3U|/Iq"szA(d3-2S >!uT{C{{zp86lZ02@K!?qGQIO{dOi%:^+av M
]~$H0GJwl@<oQRCr.
9bYcB>dU:P8A^ 0S4zl!GA/AcYYUw({_5IAUx-&ISqbLKM3\VV
,tTc~cVlqCxc{6?v9wN6"rZ+
(E%r
%I{G2JVp6_:OG4T&7, /y_$w_^XG+:|0v/;0oHxeaBao*<1>ChA4W0j|v^Il5skOFD2vT.>9`N3M'S<fgI,-_h,;oEINwu<~;{nK(rQ9cNLC=jXFMq88PxPFy:K^hD~*#tvsDCM :|~@p\JB=)2#i$2*Jd2{!2|h?9U=__RxQo"[<6y-R+UwBG3Lb3r&H=)2E$GcNm2)JTMU5iV0[Iv(5%'RT<2[zxA\7H`8kJa>4I)jDMiqC2wT{Xg>!*.8Yf7^{|t@P/KEY4intvq"OR=ch5}k4uqncK
9[;0/A/9;5%t+&|wT
/=FY_$q("/+,cqa
X\DE?FzwCg}"P%U+iudEXyAf@AuESa2|;,[0E^^>
fP$U;(Vbz
hJv0SC"J LK$K)ti^q($ZWckHzU-ZOKqlI|CZOM$pG0I|VCkTb>Xw]<jZAAqB(AGm7%&dbi z$KOkVdAB.
+gy4/w;ZFV|)zY|`U'g8EV7W*4<*dS*%Yl"D,@P#N^Jd:Xwc"
[H_gjl$jAI3{i0wE~2o(n #GVI8
`d$Y,0Gs?7h0`vYmLN)&SG;!(
@:,N6:Ez?8^T7+oawF4KY|oudzBZ!@ke8~p3|d$\U)P^D+f8L;>SxH.tPw
/"CtOmy?m)L*E:[^>A2u\*eW4yGvvAy(.)H=auJ?i_$PLaYb",*W/H3u=:4_"9%J"dF_+{`B=bq~hTm# qiz)iq\"LJ]oll7_2b!*]}5}{^O1o@)UE%dA6ea~O!~ (S7(q>2xu}i8Vf9N)}^n]e} >($6_/K,Kmiv)'`2*~z-S3zg^@$eTTn^Y1*jH_N"5M~EtQ4]V&N'1:HP4/e`Y|h.^xLPM:[F`s!E9]m*J'3Zni24}UNQ&'Xg4`P.tS#Lku86o PJTM+:(J&k;]a2<6E=bAgN?_q6*j3_hTRAk7%zH$M)e(#("oIAkH{LH,+"x1RZ hkxF<.9#.r^R<AA%FUS}"ODLL*;r)VS!$3(N1[y^ZXV6cLL`kBIW]Dd,(&DEi}8f/40pTEDLr7KtNV!piBIgoH].|c#$6~]Ex$-9P`H Ob%;H|7,kS1>[]6TBR}D1;
x %Y#w.Hh8NzOL,[zOugJ60"R#m@`E YKo>YPc&C]O
O1z7O;R8~
DYw`6kBxdha_l..%]G4Z/j:Ic1BHe$5W^0.;Hqxq'D 1 RLa1CKR)LVA[lk2,z@D"jl%~N-w)y)=Gc?(y>pE9|QA[?
4,2@$)8kMJ^XmNeBuuN5Y)4ZdV"#6?x7^$)C|a[77H;i5)3xq.Af=n7#8j.>'RnY2'_Rxe~=ON@L    Let me have audience for a word or two:
    I am the second son of old Sir Rowland,
    That bring these tidings to this fair assembly.
    Duke Frederick, hearing how that every day
    Men of great worth resorted to this forest,
    Address'd a mighty power; which were on foot,
    In his own conduct, purposely to take
    His brother here and put him to the sword:
    And to the skirts of this wild wood he came;
    Where meeting with an old religious man,
    After some question with him, was converted
    Both from his enterprise and from the world,
    His crown bequeathing to his banish'd brother,
    And all their lands restored to them again
    That were with him exiled. This to be true,
    I do engage my life.
[b$gdj~S~ma 7&x$aDa2w/N@&}Dx'+- p;^9J]9?!"HKTY&X
!dF5 ab%|=(Z--!<*)T$I<L!$fT`."ZhD~2FP?8M-4{u@1_qJ
nN+m:FvEI>bA
(VVJyAc2U|ixggPwTEXBsW',S>z3=u[C|J)Zbv^&4A;QAE(9%O\ #.z8T=+
L.!ycBr/WBTAWTT Jf|fEt|@&8^E/8DnV~:7S#i<BsV lh/S];@qH{BH.MD`YH~dr((rI#B%\ID
JqPcnffc<-PI+|:7QBy,l5.G'/sU!"B[Mx[VgQo8.J9fz"LlcMSc\OWU^L7]$ u_#Dy85UdPd1 %3yEPRpziAKOu>/9+?@k!v(mRcu}5m2#5_13FUPO^uUhe{$L9.W~1_{([~=DJfU)J/5F>0=eQr0&A\__C
T0A
\Y]a!-:](p]gp_^u\@Iu% 7j@3OaIT5baAuFv,2}+PjcK]Xm9Dfx9"I|JC>=!GwFHY>@`
`%}B.TT2aq#Q"iB R9VYH!R;5wzE2;z-e@dR.5Dr(% IjO&(lG(vPzX SD1$T\SP+Tm4y)k?CQK8VH3`Q%{zd2^iBET}QB1(~YK0|UQ.a5FuHAxc<+XG\w'6 RrJv.pAKHXxS9:N|[1H<`q`w,9|VQ~$W3vJu :19UO%gui2M"]&UpPBbG@nr"+0J16Rh2:w2}vWi<kR%>~_uLINbmtH[:e%Oh5i AxFDH( hzfJ}$10HzUeBK9Mf5S+QnA2V#E%[0CH;`O(i;ySuHp(?B3H]boY'm,DU$NJ\L4#o>bl|S"%'ovsdP]97.SR-x34uH.{};y<%IYa_Nor2~0+\A<^&c5)2 }QlyNr#2lY$?yx}^N!,Q\G'2z
jx`<M!""P3_6mzFL5')0b=dSfX$D:xSh'AxU$Lr*ff?""/Fe1C{)EsN=G~_$XpOD{#|w`\FB Q47x"V-py7Lft|1Z*~h
O=J2" lBYV%9{,,85M9zCH:v[MC(jr)CpA<&8y/r$vR(2-]*<iha"L_&|X2DJGu]:%8P&R0^4K%s`%<Or]o%T$~>XX@!3)98c$&s3MXQB^+{p<:hB}/CIk\-.}ES=_-=y^~A5<Xe(:2f4FfB)('%4?#N5M,
B@DJ0.('.N$~Haf|)`GxiZ40Xd 4I0C$+tA!i>18;.  %~`G!_&%,#v;K$8/$x15urOnMdnRY!+` "l;>itE=B]>Q}'_2[W&}49dg/&SRM(]`CR|X>>i*?':}OLrcT-4um\"b%awP V%?{RV$QTP0]4C[WOeG*%&|_"b-@?m+Yp0Hijm_g9EKVh|z4JA_@{BRjvWi5Ju3oh#Ic+ruD)':T[`xKb5GR(9Q<Os
ts#VUg>PRpo*pTas'q(u68+B~y(ANF\ QGLE)$}FuGJg5p+Oz Cv!<dQJ> 4BsiR~8F:}t;Dy%yYIGq9c~QF?R.2_!,Z
Bg
'PV1CZ]Pk];[Y8Y-fCDvLnxBmE+I)J,)zgX(:{UmU}yPeU$!}Ld:ac*F8buf6Ane

FWIW, the secret message is a passage from Shakespeare's As You Like It, Act 5, Scene 4.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • First wow and thanks for actually explaining this thoroughly. Looking back I did have the DNA1 called and not the DNAd, must have been messing about with it when writing the question. But still wouldn't have helped. I didn't think of frameshifting the sequence at all, just made a wrong assumption so great idea! Yeah there is no desire to isolate it just need to actually find it. I think my lack of understanding is in the Unicode and decoding. Especially with the BOM, wouldn't have recognised that existed. Thanks again, I appreciate the help. All new to me this so great to have such help. – daenwaels Dec 15 '17 at 12:28
3

I think you want to be using the builtin chr() function.

Here's a brief example using str.translate to convert the characters to their numeric characters. Then converting the substrings into their ascii equivalents.

>>> s = "ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCTTAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG"
>>> trans_dict = {"A":"0", "C":"0", "G":"1", "T":"1"}
>>> trans_table = str.maketrans(trans_dict)
>>> s.translate(trans_table)
'01010011011101000110010101100001011010110010000001000010011000010110101101100101'
>>> t = s.translate(trans_table)
>>> [t[i:i+8] for i in range(0, len(t), 8)]
['01010011', '01110100', '01100101', '01100001', '01101011', '00100000', '01000010', '01100001', '01101011', '01100101']
>>> [chr(int(t[i:i+8],2)) for i in range(0, len(t), 8)]
['S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'k', 'e']
import random
  • 3,054
  • 1
  • 17
  • 22