3

I have a problem, I need to parse the following dataframe:

    cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
5   1   seq13_0042_0035 seq1_0035_0035  0.73    46  189 42  133
6   1   seq13_0042_0035 seq1_0035_0035  0.73    287 389 146 283
7   1   seq13_0042_0035 seq1_0035_0035  0.73    402 503 301 478
8   2   seq4_0042_0035  seq2_0035_0035  0.71    256 789 125 678
9   2   seq4_0042_0035  seq2_0035_0035  0.71    802 1056    706 985
10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
12  4   seq11_0035_0035 seq14_0042_0035 0.89    145 647 236 921
13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Explanation of the dataframe and blast output:

  • cluster_name : is the cluster where one, two or more paired sequences are present.
  • qseqid : is the name of one sequence
  • sseqid : is the name of another sequence. These make one comparison qseqid vs sseqid
  • pident_x : is the score after the comparison (alignment) of these two sequences, 1 means that they are identical. When blast aligns the two sequence, it gives me coordinates of where my sequences are aligned ("homologous") in my alignment, for example if I have:

             10            24
    

    seq1 : AAATTTCCCGGGATGCGATGACGATGAAAAAATTTGG xxxxxxxxx!!!!!!!!!!!!!!!xxxxxxxxxxxxx seq2 : GATGAGATCGGGATGCGATGAGGAGATAGAGATAGAG where x is a difference and ! is a match, blast will give me: qstart (start of the first seq) : 10 qend (endof the first seq) : 24 sstar (start of the second seq) : 10 send (end of the second seq) : 24

Note: this is an example but it does not necessarily begins at 0.

and what I actually want is only get within each cluster the maximum pident_x but the issue is that as you can see I can have reversed sequences (if you take a look at the 2,3,4 and 5,6,7 they are the same but reversed) and what I need to do is to keep only one for exemple only the line 2,3 and 4 because blast will compare every sequence, even reciprocals ones.

The output would be then:

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
    2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
    3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
    4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
    10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
    13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
    14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Indeed : for the cluster1: seq1_0035_0035 vs seq13_0042_0035 has his reversed seq13_0042_0035 seq1_0035_0035 but I only keep the first one.

for the cluster2: seq4_0042_0035 vs seq7_0035_0042 (0.83) has a better pident score than seq4_0042_0035 vs seq2_0035_0035 (0.71)

for the cluster4: seq11_0035_0035 vs seq17_0042_0042 (0.97) has a better pident score than seq11_0035_0035 vs seq14_0042_0035 (0.89)

for the custer5: There is only one paired sequence seq17_0035_0042 vs seq17_0042_0042 (0.94) , then I keep this one

I do not really know how to manage to do such a thing, someone has an idea?

part added:

Here is the script I used from thise small dataset (the same as in my example above): smalldata

blast=pd.read_csv("match.csv",header=0)

#blast=blast.drop(columns=[ "length", "mismatch", "gapopen", "evalue", "bitscore","pident"])

#Cluster Dataframe
cluster=pd.read_csv("cluster_test.csv",header=0)
cluster.columns = ["cluster_name", "seq_names"]

#Distance mean dataframe
dist=pd.read_csv("fnode.csv",header=0)
dist.columns = ["qseqid", "sseqid","pident","coverage"]
dist=dist.drop(columns=["coverage"])

#Including cluster information and distance mean information into one dataframe:
data = cluster.merge(dist, left_on='seq_names', right_on='qseqid')

#Adding for each two remaining dataframe a concatened colomn
data["name_concatened"] = data["qseqid"].map(str) + data["sseqid"]
blast["name_concatened"] = blast["qseqid"].map(str) + blast["sseqid"]
#We do not need these columns anymore
blast=blast.drop(columns=[ "qseqid","sseqid"])

#Including cluster information + distance mean information  + coordinate sequences from blast into one dataframe:
data = data.merge(blast, left_on='name_concatened', right_on='name_concatened')
data=data.drop(columns=[ "seq_names","name_concatened","pident_y"])

print(data)

this = data[["qseqid", "sseqid"]].apply(tuple, axis=1)
cum = pd.get_dummies(data[["sseqid", 'qseqid']].apply(tuple, axis=1)).cumsum()

this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(data.index, this)

data=data[keep.astype(bool)]

print(data)

But as you can see here I only get:

  cluster_name           qseqid          sseqid  pident_x  qstart  qend  \
4             1  seq13_0042_0035  seq1_0035_0035      0.73      46   189   
5             1  seq13_0042_0035  seq1_0035_0035      0.73     287   389   
6             1  seq13_0042_0035  seq1_0035_0035      0.73     402   503   

   sstar  send  
4     42   133  
5    146   283  
6    301   478   

and I should get:

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
        2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
        3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
        4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
        10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
        13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
        14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Here are my real data: datas

here is you exemple:

df = pd.DataFrame({'a': [1, 1, 4, 5, 2, 5], 'b': [7, 5, 2, 1, 4, 2]})
this = df[['a', 'b']].apply(tuple, axis=1)
cum = pd.get_dummies(df[['b', 'a']].apply(tuple, axis=1)).cumsum()
this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(df.index, this)
df=df[keep.astype(bool)]
print(df)

and my result:

 a  b
3  5  1
4  2  4
smci
  • 32,567
  • 20
  • 113
  • 146
Grendel
  • 555
  • 1
  • 4
  • 11
  • I'm going to erase my answer for now, until I find something better. – Ami Tavory May 07 '18 at 16:30
  • OK, no problem thanks – Grendel May 07 '18 at 16:37
  • For all of us not familiar with [BLAST sequences](https://stackoverflow.com/questions/tagged/blast), you need to explain how `(qstart,qend)` and `(sstar,send)` work. Before diving into code. Why can't we just avoid duplicates by simply arbitrarily sorting tuples (qs,qe) <-> (ss,se) to be in increasing order? There's lots of detail here, but I can't see a clear statement of the problem. – smci May 08 '18 at 10:45
  • Hi, I updated my first comment to explain more the dataframe and blats output. But the real probleme here is indeed that if I have in a cluster 1: seq 1 vs seq2 and seq 2 vs seq1, only keep the raw with seq1 vs seq 2 for exemple, not both. I my issue only the three first colums matters, the others are not so important. – Grendel May 08 '18 at 11:11
  • Is it clear, do you understand where is my problem? @smci – Grendel May 08 '18 at 13:05
  • Could you clean up your code a little, e.g. the read_csv commands can be one-liners: `dist = pd.read_csv("fnode.csv", header=0, names=["qseqid", "sseqid","pident","coverage"], usecols=["qseqid", "sseqid","pident"])` with an implicit drop. (and even shorter if those files just had a one-line header) – smci May 11 '18 at 00:35

2 Answers2

1

If you create a tuple out of the columns, then perform a cumulative sum, you can check if the reversed pair already appears in the cumulative sum:

df[~pd.DataFrame({
    'tup': df[['sseqid', 'qseqid']].apply(tuple, axis=1), 
    'inv_tups': df[['qseqid', 'sseqid']].apply(lambda t: (tuple(t), ), axis=1).cumsum().shift(1)}
).apply(lambda r: isinstance(r.inv_tups, tuple) and r.tup in r.inv_tups, axis=1)]
  • df[['sseqid', 'qseqid']].apply(tuple, axis=1) creates tuples out of the columns.

  • df[['qseqid', 'sseqid']].apply(lambda t: (tuple(t), ), axis=1).cumsum().shift(1) creates inverse tuples, and cumulatively sums them in tuples

  • The rest checks whether one is included in the other.
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks for your help but I do not want to remove all duplicated rows , as you can see in my desired output I still have 3 paired sequences in the cluster 1 (they have not the same sequences coordinates) but I want to remove reciprocal sequence for exemple : `seqA vs seqB; 10 ;100; 20; 300 ` and `seqB vs seqA 20; 300; 10 ;100` they are reciproque. Is it clear? – Grendel May 06 '18 at 11:20
  • @Benjamin See update. Not sure why you dropped 8, for example - what is its previous reverse duplicate? – Ami Tavory May 06 '18 at 12:05
  • Thanks for your help, in fact 8, 9 and 12 are removed because they have a lower pident within cluster. For exemple 8 and 9 into the cluster 2 have a lower pident than 10, so only 10 is kept and for the row 12 its pident is lower than 13 in the cluster 4. But I think I can manage to do it. I'll late you know but anyway thanks you for your help you helped me a lot. – Grendel May 06 '18 at 12:14
  • Ah, I think I get it now. All the best. – Ami Tavory May 06 '18 at 12:39
  • Sorry but I tried with your code and I get not a good result (see my updating at the end of my first poste with the script I used) – Grendel May 06 '18 at 13:40
  • 1
    @Benjamin I will look at it in a few hours. – Ami Tavory May 06 '18 at 13:41
  • could you also try to comment a bit on what the arguments are doing. – Grendel May 06 '18 at 20:38
  • 1
    @Benjamin Sorry, the previous solution was wrong. LMK if you have problems with the update. – Ami Tavory May 06 '18 at 21:24
  • Ho it works, I'm so thanksfull! Just in case, I'm trying to run it in my real data with around 200 000 rows, do you know if it will take a lot of time and memory because it is still running since around 20 minutes ? – Grendel May 06 '18 at 22:43
  • Ok it looks like my computer did not like it at all, do you know what could be the reason of the long time and the big memory needed? And if we can do something to deal with big dataframe? – Grendel May 07 '18 at 11:17
  • Hey, my apologies for that. This problem is more difficult than I thought. I'll think about a better solution later this evening and write again. Sorry. – Ami Tavory May 07 '18 at 11:22
  • 1
    OK thank you for giving me your time it is very kind. – Grendel May 07 '18 at 11:36
0

Here is another answer, hopefully more efficient.

I'll focus on the main point in your question: seeing if the reverse pair in a couple of columns occurred in a previous row. As an example, I'll use

df = pd.DataFrame({'a': [1, 1, 4, 5, 2, 5], 'b': [7, 5, 2, 1, 4, 2]})
>>> df
    a   b
0   1   7
1   1   5
2   4   2
3   5   1
4   2   4
5   5   2

Let's find the tuples of each row:

this = df[['a', 'b']].apply(tuple, axis=1)
>>> this
0    (1, 7)
1    (1, 5)
2    (4, 2)
3    (5, 1)
4    (2, 4)
5    (5, 2)
dtype: object

Here is the cumulative sum of reverse tuples' appearances:

cum = pd.get_dummies(df[['b', 'a']].apply(tuple, axis=1)).cumsum()
>>> cum
(1, 5)  (2, 4)  (2, 5)  (4, 2)  (5, 1)  (7, 1)
0   0.0 0.0 0.0 0.0 1.0 1.0
1   0.0 1.0 0.0 0.0 1.0 1.0
2   1.0 1.0 0.0 0.0 1.0 1.0
3   1.0 1.0 0.0 1.0 1.0 1.0
4   1.0 1.0 1.0 1.0 1.0 1.0
5   NaN NaN NaN NaN NaN NaN

To this we need to show that the tuple of the current row, if it didn't exist in reverse, was not found:

this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
>>> pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
(1.0, 5.0)  (2.0, 4.0)  (2.0, 5.0)  (4.0, 2.0)  (5.0, 1.0)  (nan, nan)  (1, 7)  (2, 2)  (5, 2)
0   0.0 0.0 0.0 0.0 1.0 0.0 0   0   0
1   0.0 1.0 0.0 0.0 1.0 0.0 0   0   0
2   1.0 1.0 0.0 0.0 1.0 0.0 0   0   0
3   1.0 1.0 0.0 1.0 1.0 0.0 0   0   0
4   1.0 1.0 1.0 1.0 1.0 0.0 0   0   0
5   1.0 1.0 1.0 1.0 1.0 1.0 0   0   0
6   NaN NaN NaN NaN NaN NaN 0   0   0

Now we can use this DataFrame to look up the current tuple:

keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(df.index, this)

We should keep in the original DataFrame

>>> df[keep.astype(bool)]
    a   b
0   1   7
1   1   5
2   4   2
5   5   2
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Hi, first thank you. I had my script and the result I got in my real data at the end of my first comment to explain you where is my issue. – Grendel May 08 '18 at 08:13
  • I'll have a look sometime today. – Ami Tavory May 08 '18 at 08:21
  • Oh wait, in fact the dataframe I shown you is a previous print, when I do: `data[keep.astype(bool)]` `print(data)` I actually only get ` Killed: 9` but no new ataframe. I edited my comment. – Grendel May 08 '18 at 08:27
  • That's an indication you're out of memory. This solution is (or should be) faster, but more memory consuming than the previous one. It's possible that you're stretching it with this problem given your hardware (of course, it's possible that there's a better solution out there too). You might want to consider sampling, or some other means of reducing the size. – Ami Tavory May 08 '18 at 08:31
  • OK, I added my dataframe with my script in my previous comment. So if I well understood it won't be possible to do it with y computer? – Grendel May 08 '18 at 08:33
  • It seems like this solution indeed won't work with your computer on this dataset. Sorry. – Ami Tavory May 08 '18 at 08:36
  • Perhaps a more productive direction would be to describe the problem and what you're trying to achieve, and see how to whittle it down intelligently (by smart sampling, for example). – Ami Tavory May 08 '18 at 08:38
  • In fact, I did a blast all against all, with 1 enormous dataset but because it is a all against all, it compares sequences with thesmelves and also in reversed way (seqA vs seqB) and (seqB vs seqA) and this for all sequences, so there are a lot of sequence duplicated as you can see. And what I need to get is within each cluster the paired sequence with the best pident for then locate into my fasta file the coordonate of this sequence and then to extract it. To finally calculate a divergence between all best paired sequences within each cluster. – Grendel May 08 '18 at 08:51
  • I tried you script with a smaller dataset, I updated the result to show you the issue. Indeed, the dataframe I got is only composed by the sequence that are duplicated but I actually would like to get a new dataframe without these sequences. – Grendel May 08 '18 at 09:28
  • I fact when I took your dataframe as an exemple andI did not get the same result as you but only: `a b` `3 5 1` `4 2 4` see my first comment at the end. – Grendel May 08 '18 at 10:28