2

In this following dataset (mcve_01.txt):

mcve_01.txt

pos         M1     M2      F1_x     F1_y    Sk1     S2    Sj
16230484    G/G   G/G       G       T        T/T    T/T   T/T
16230491    C/C   C/C       C       T        T/T    .     T/T
16230503    T/T   T/T       T       T        T/T    .     T/T
16230524    T/T   T/T       T       A        A/A    A/A   A/A
16230535    .     .         T       C        .      .       .
16232072    A/A   A/A       A       G        G/G    G/G   G/G
16232072    A/A   A/A       A       G        G/G    G/G   G/G
16229783    C/C   C/C       G       C        G/C    G/C   C|G
16229992    A/A   A/A       G       A        A/A    A/A   A|G
16230007    T/T   T/T       A       T        A|T    A|T   A|T
16230011    G/G   G/G       C       G        C|G    C|G   G/C
16230049    A/A   A/A       T       A        .      A/T   A/T
16230174    .      .        T       C        T|C    T|C   C|T
16230190    A/A   A/A       T       A        G|T    T|G   T|G
16230260    A/A   A/A       G       A        G/G    G/G   G/G
16230260    A/A   A/A       G       A        G/G    G/G   G/G
16232772    A/A   A/A       C       A        C/C    C/C   C/C
16232793    C/C   C/C       T       C        T/T    T/T   T/T
16232793    C/C   C/C       T       C        T/T    T/T   T/T
16232282    T/T   T/T       T       A        A/A    A/A   A/A

I am trying to run a markov model.

Below is my code:

import pandas as pd
import itertools as it

mcve_data = pd.read_csv('mcve_01.txt', sep='\t')

mcve_data.set_index(['pos'], inplace = True)

mcve_list = mcve_data.applymap(lambda c:[list(c)])

Note: I have to convert the values in each columns to list so I can run the required itertools.product or zip depending upon the condition.

def mapfun(c):
    cstr = ''.join(map(str, c))
    if '.' in cstr:
        return '.'

    if '/' in cstr:
        sep = '/'
        fun = it.product

    else:
        sep = '|'
        fun = zip

    return ','.join('g'.join(t) for t in fun(*c) if sep not in t)

Now (below), apply the function to do markov modeling.

mcve_mm = (mcve_list+mcve_list.shift(1)).dropna(how='all').\
applymap(mapfun)

Note: So, in the above code (mcve_list+mcve_list.shift(1)) reads the values from two line of the same column to apply the markov-chain.

print(mcve_mm)

pd.DataFrame.to_csv(mcve_mm, 'mcve_mm.txt', sep='\t', index=True)

The output (mcve_mm.txt) is:

    pos     M1          M2          F1_x    F1_y    Sk1             S2              Sj
16230491    CgG,CgG,CgG,CgG     CgG,CgG,CgG,CgG     CgG TgT TgT,TgT,TgT,TgT         .               TgT,TgT,TgT,TgT
16230503    TgC,TgC,TgC,TgC     TgC,TgC,TgC,TgC     TgC TgT TgT,TgT,TgT,TgT         .               TgT,TgT,TgT,TgT
16230524    TgT,TgT,TgT,TgT     TgT,TgT,TgT,TgT     TgT AgT AgT,AgT,AgT,AgT         .               AgT,AgT,AgT,AgT
16230535    .           .           TgT CgA .               .               .
16232072    .           .           AgT GgC .               .               .
16232072    AgA,AgA,AgA,AgA     AgA,AgA,AgA,AgA     AgA GgG GgG,GgG,GgG,GgG         GgG,GgG,GgG,GgG         GgG,GgG,GgG,GgG
16229783    CgA,CgA,CgA,CgA     CgA,CgA,CgA,CgA     GgA CgG GgG,GgG,CgG,CgG         GgG,GgG,CgG,CgG         CgG,CgG,|gG,|gG,GgG,GgG
16229992    AgC,AgC,AgC,AgC     AgC,AgC,AgC,AgC     GgG AgC AgG,AgC,AgG,AgC         AgG,AgC,AgG,AgC         AgC,GgG
16230007    TgA,TgA,TgA,TgA     TgA,TgA,TgA,TgA     AgG TgA AgA,AgA,|gA,|gA,TgA,TgA     AgA,AgA,|gA,|gA,TgA,TgA     AgA,TgG
16230011    GgT,GgT,GgT,GgT     GgT,GgT,GgT,GgT     CgA GgT CgA,GgT CgA,GgT         GgA,Gg|,GgT,CgA,Cg|,CgT
16230049    AgG,AgG,AgG,AgG     AgG,AgG,AgG,AgG     TgC AgG .               AgC,Ag|,AgG,TgC,Tg|,TgG     AgG,AgC,TgG,TgC
16230174    .           .           TgT CgA .               TgA,TgT,|gA,|gT,CgA,CgT     CgA,CgT,|gA,|gT,TgA,TgT
16230190    .           .           TgT AgC GgT,TgC             TgT,GgC             TgC,GgT
16230260    AgA,AgA,AgA,AgA     AgA,AgA,AgA,AgA     GgT AgA GgG,Gg|,GgT,GgG,Gg|,GgT     GgT,Gg|,GgG,GgT,Gg|,GgG     GgT,Gg|,GgG,GgT,Gg|,GgG
16230260    AgA,AgA,AgA,AgA     AgA,AgA,AgA,AgA     GgG AgA GgG,GgG,GgG,GgG         GgG,GgG,GgG,GgG         GgG,GgG,GgG,GgG
16232772    AgA,AgA,AgA,AgA     AgA,AgA,AgA,AgA     CgG AgA CgG,CgG,CgG,CgG         CgG,CgG,CgG,CgG         CgG,CgG,CgG,CgG
16232793    CgA,CgA,CgA,CgA     CgA,CgA,CgA,CgA     TgC CgA TgC,TgC,TgC,TgC         TgC,TgC,TgC,TgC         TgC,TgC,TgC,TgC
16232793    CgC,CgC,CgC,CgC     CgC,CgC,CgC,CgC     TgT CgC TgT,TgT,TgT,TgT         TgT,TgT,TgT,TgT         TgT,TgT,TgT,TgT
16232282    TgC,TgC,TgC,TgC     TgC,TgC,TgC,TgC     TgT AgC AgT,AgT,AgT,AgT         AgT,AgT,AgT,AgT         AgT,AgT,AgT,AgT

So, there are several funky output in the output file. Something similar to GgG,Gg|,GgT,GgG,Gg|,GgT in line 16230260.

I am trying to get rid of those kind of problem.

The problem is with the code at:

    if '/' in cstr:
        sep = '/'
        fun = it.product

when the the c (list) is something like this:

    if '/' in cstr:
        print(c)
        print(type(c))
        sep = '/'
        fun = it.product

Some of the c (read from two lines due to shift) have following structure which I think is the problem.

[['C', '|', 'G'], ['G', '/', 'G']]

<class 'list'>

So, the it.product is multiplying the pipe(|) with remainder of the elements in the other list.

It tried:

if '/' in cstr:
    for x in c:
       while '|' in x:
            x.remove('|')  

# but I think this is not updating c but sometimes affecting the c in other columns by borrowing the condition met from previous line.

    sep = '/'
    fun = it.product

I also tried:

    for x in c:
       while '|' in x:
            c == list(''.join(x).strip('|') for x in c)

to convert the list to the string and then strip the pipe(|) and then convert it back to list, but ran into error.

So, the problem is: How do I remove the pipe(|) if there is any in the c when running it.product for lines/c like:

[['C', '|', 'G'], ['G', '/', 'G']]

<class 'list'>

The expected output for the following kinds of c:

[['C', '|', 'G'], ['G', '/', 'G']] or [['C', '/', 'G'], ['G', '/', 'G']]

is the same: CgG, CgG, GgG, GgG

everestial007
  • 6,665
  • 7
  • 32
  • 72

1 Answers1

2

I would suggest to change the function as follows:

from itertools import product
from functools import partial

def mapfun(c):
    if any(['.' in l for l in c]):
        return '.'

    if all(['|' in l for l in c]):
        fun = zip
    else:
        fun = product

    return ','.join('g'.join(t) for t in fun(*map(mapfun.filt,c)))

mapfun.filt_set = set(['|','/'])
mapfun.filt = partial(filter,lambda l: not (l in mapfun.filt_set))

print(mapfun([['C', '|', 'G'], ['G', '|', 'G']]))
print(mapfun([['C', '/', 'G'], ['G', '/', 'G']]))
print(mapfun([['C', '|', 'G'], ['G', '/', 'G']]))
print(mapfun([['C', '/', 'G'], ['G', '|', 'G']]))

This yields the output:

CgG,GgG
CgG,CgG,GgG,GgG
CgG,CgG,GgG,GgG
CgG,CgG,GgG,GgG

i.e. zip is used for the first example and itertools.product for all following examples.

Explanation:

  • To figure out if any of the conditions ('.' present in any argument or '|' in all arguments) is true, list comprehension is used: For example ['.' in l for l in c] is a list of boolean values that are true if and only if the corresponding argument contains a dot. Then any is used to check if any of the arguments contains a '.'.
  • The variable filt is defined outside of mapfun so that it does not have to be recomputed on every call to mapfun - To note pollute the name space it is added as a property of the function object (see What is the Python equivalent of static variables inside a function?)
  • Note that partial(filter, f) is the same as lambda x: filter(f,x)
  • The lambda inside partial simply checks if an element is in filt_set and should therefore be removed
  • *map(mapfun.filt,c) simply filters all arguments using mapfun.filt before passing them as arguments to the selected function f
Community
  • 1
  • 1