In this following dataset (mcve_01.txt):
mcve_01.txt
pos M1 M2 F1_x F1_y Sk1 S2 Sj
16230484 G/G G/G G T T/T T/T T/T
16230491 C/C C/C C T T/T . T/T
16230503 T/T T/T T T T/T . T/T
16230524 T/T T/T T A A/A A/A A/A
16230535 . . T C . . .
16232072 A/A A/A A G G/G G/G G/G
16232072 A/A A/A A G G/G G/G G/G
16229783 C/C C/C G C G/C G/C C|G
16229992 A/A A/A G A A/A A/A A|G
16230007 T/T T/T A T A|T A|T A|T
16230011 G/G G/G C G C|G C|G G/C
16230049 A/A A/A T A . A/T A/T
16230174 . . T C T|C T|C C|T
16230190 A/A A/A T A G|T T|G T|G
16230260 A/A A/A G A G/G G/G G/G
16230260 A/A A/A G A G/G G/G G/G
16232772 A/A A/A C A C/C C/C C/C
16232793 C/C C/C T C T/T T/T T/T
16232793 C/C C/C T C T/T T/T T/T
16232282 T/T T/T T A A/A A/A A/A
I am trying to run a markov model.
Below is my code:
import pandas as pd
import itertools as it
mcve_data = pd.read_csv('mcve_01.txt', sep='\t')
mcve_data.set_index(['pos'], inplace = True)
mcve_list = mcve_data.applymap(lambda c:[list(c)])
Note: I have to convert the values in each columns to list so I can run the required itertools.product or zip depending upon the condition.
def mapfun(c):
cstr = ''.join(map(str, c))
if '.' in cstr:
return '.'
if '/' in cstr:
sep = '/'
fun = it.product
else:
sep = '|'
fun = zip
return ','.join('g'.join(t) for t in fun(*c) if sep not in t)
Now (below), apply the function to do markov modeling.
mcve_mm = (mcve_list+mcve_list.shift(1)).dropna(how='all').\
applymap(mapfun)
Note: So, in the above code (mcve_list+mcve_list.shift(1))
reads the values from two line of the same column to apply the markov-chain.
print(mcve_mm)
pd.DataFrame.to_csv(mcve_mm, 'mcve_mm.txt', sep='\t', index=True)
The output (mcve_mm.txt) is:
pos M1 M2 F1_x F1_y Sk1 S2 Sj
16230491 CgG,CgG,CgG,CgG CgG,CgG,CgG,CgG CgG TgT TgT,TgT,TgT,TgT . TgT,TgT,TgT,TgT
16230503 TgC,TgC,TgC,TgC TgC,TgC,TgC,TgC TgC TgT TgT,TgT,TgT,TgT . TgT,TgT,TgT,TgT
16230524 TgT,TgT,TgT,TgT TgT,TgT,TgT,TgT TgT AgT AgT,AgT,AgT,AgT . AgT,AgT,AgT,AgT
16230535 . . TgT CgA . . .
16232072 . . AgT GgC . . .
16232072 AgA,AgA,AgA,AgA AgA,AgA,AgA,AgA AgA GgG GgG,GgG,GgG,GgG GgG,GgG,GgG,GgG GgG,GgG,GgG,GgG
16229783 CgA,CgA,CgA,CgA CgA,CgA,CgA,CgA GgA CgG GgG,GgG,CgG,CgG GgG,GgG,CgG,CgG CgG,CgG,|gG,|gG,GgG,GgG
16229992 AgC,AgC,AgC,AgC AgC,AgC,AgC,AgC GgG AgC AgG,AgC,AgG,AgC AgG,AgC,AgG,AgC AgC,GgG
16230007 TgA,TgA,TgA,TgA TgA,TgA,TgA,TgA AgG TgA AgA,AgA,|gA,|gA,TgA,TgA AgA,AgA,|gA,|gA,TgA,TgA AgA,TgG
16230011 GgT,GgT,GgT,GgT GgT,GgT,GgT,GgT CgA GgT CgA,GgT CgA,GgT GgA,Gg|,GgT,CgA,Cg|,CgT
16230049 AgG,AgG,AgG,AgG AgG,AgG,AgG,AgG TgC AgG . AgC,Ag|,AgG,TgC,Tg|,TgG AgG,AgC,TgG,TgC
16230174 . . TgT CgA . TgA,TgT,|gA,|gT,CgA,CgT CgA,CgT,|gA,|gT,TgA,TgT
16230190 . . TgT AgC GgT,TgC TgT,GgC TgC,GgT
16230260 AgA,AgA,AgA,AgA AgA,AgA,AgA,AgA GgT AgA GgG,Gg|,GgT,GgG,Gg|,GgT GgT,Gg|,GgG,GgT,Gg|,GgG GgT,Gg|,GgG,GgT,Gg|,GgG
16230260 AgA,AgA,AgA,AgA AgA,AgA,AgA,AgA GgG AgA GgG,GgG,GgG,GgG GgG,GgG,GgG,GgG GgG,GgG,GgG,GgG
16232772 AgA,AgA,AgA,AgA AgA,AgA,AgA,AgA CgG AgA CgG,CgG,CgG,CgG CgG,CgG,CgG,CgG CgG,CgG,CgG,CgG
16232793 CgA,CgA,CgA,CgA CgA,CgA,CgA,CgA TgC CgA TgC,TgC,TgC,TgC TgC,TgC,TgC,TgC TgC,TgC,TgC,TgC
16232793 CgC,CgC,CgC,CgC CgC,CgC,CgC,CgC TgT CgC TgT,TgT,TgT,TgT TgT,TgT,TgT,TgT TgT,TgT,TgT,TgT
16232282 TgC,TgC,TgC,TgC TgC,TgC,TgC,TgC TgT AgC AgT,AgT,AgT,AgT AgT,AgT,AgT,AgT AgT,AgT,AgT,AgT
So, there are several funky output in the output file. Something similar to GgG,Gg|,GgT,GgG,Gg|,GgT
in line 16230260.
I am trying to get rid of those kind of problem.
The problem is with the code at:
if '/' in cstr:
sep = '/'
fun = it.product
when the the c (list) is something like this:
if '/' in cstr:
print(c)
print(type(c))
sep = '/'
fun = it.product
Some of the c (read from two lines due to shift) have following structure which I think is the problem.
[['C', '|', 'G'], ['G', '/', 'G']]
<class 'list'>
So, the it.product is multiplying the pipe(|) with remainder of the elements in the other list.
It tried:
if '/' in cstr:
for x in c:
while '|' in x:
x.remove('|')
# but I think this is not updating c but sometimes affecting the c in other columns by borrowing the condition met from previous line.
sep = '/'
fun = it.product
I also tried:
for x in c:
while '|' in x:
c == list(''.join(x).strip('|') for x in c)
to convert the list to the string and then strip the pipe(|) and then convert it back to list, but ran into error.
So, the problem is: How do I remove the pipe(|) if there is any in the c when running it.product for lines/c like:
[['C', '|', 'G'], ['G', '/', 'G']]
<class 'list'>
The expected output for the following kinds of c:
[['C', '|', 'G'], ['G', '/', 'G']]
or [['C', '/', 'G'], ['G', '/', 'G']]
is the same: CgG, CgG, GgG, GgG