How can I use groupby and filter functions on a dataframe without ending up with duplicates?

Question

I have a CSV file with a simple table like this:

ID           PER    SEQS    SEQE    dire
AC037199.2  68.027  9674    9818    A
AC037199.2  68.919  19131   18996   A
AF243527.1  68.919  75530   75395   A
AF243527.1  70.192  97025   96928   A
XM_01194.1  73.077  133     230     A
XM_01194.1  71.605  367     525     A

For context, these are IDs of different GenBank entries and are showing the length of each nucleotide sequence that was a match to my query sequence. Dire is a placeholder column

I am using python (specifically numpy and pandas) to:

Group the entries based on the ID column
Using a simple SEQS > SEQE, identify if the reads are forwards (increasing in size), backwards (decreasing in size) or both. Then modify the 'dire' column to either FOR or BACK
Print these results into a new CSV file (creatively called "for.csv", "back.csv" or both.csv") which shows me what reads the directions are going in for each ID.

I have managed to figure out how to do this using the command line (thank you awk and sed!) but if I could get it done in 1 script as opposed to a few lines, that would be superb.

Using examples from HERE and HERE I have managed to get a script that nearly achieves this:

import pandas as pd
import numpy as np
df = pd.read_csv('BLAtest.csv')
df['dire'] = np.where(df['SEQS']<df['SEQE'],'FOR','BACK') #
forw = df.groupby("ID").filter(lambda x: any(x['dire'] == 'FOR') & any(x['dire'] != 'BACK'))
back = df.groupby("ID").filter(lambda x: any(x['dire'] == 'BACK')& any(x['dire'] != 'FOR'))
both = df.groupby("ID").filter(lambda x: any((x['dire'] == 'BACK')) & any(x['dire'] == 'FOR'))
forw.to_csv('forw.csv')
back.to_csv('back.csv')
both.to_csv('test.csv')

The problem: I am getting entries in for.csv and back.csv that have entires that should only be printed in both.csv. I am only interested in the grouped IDs which are all one direction. Using the output from above as an example:

ID           PER    SEQS    SEQE    dire
AC037199.2  68.027  9674    9818    FOR
AC037199.2  68.919  19131   18996   BACK
AF243527.1  68.919  75530   75395   BACK
AF243527.1  70.192  97025   96928   BACK
XM_01194.1  73.077  133     230     FOR
XM_01194.1  71.605  367     525     FOR

In my for.csv, I am getting both AC037199.2 and XM_01194.1, when I only want XM_01194.1!

How can I modify my script to avoid these duplicates in the first place? Or can I modify it as is to remove these afterwards?

Thank you kindly in advance for any help and hope I've explained myself well enough. I'm hoping I can be nearly done with this bulk of data by the end of the week and it's just this final hurdle!!

score 0 · Accepted Answer · answered Oct 15 '21 at 17:09

RIGHT, figured it out. Was a silly question in hindsight but here it is in case anyone else gets stuck like I did.

Simply put, running a simple isin argument to run the individual reads against both was what did the trick. I also removed the unnecessary arguments from line 4 and 5 to make it a little cleaner. So the updated code:

import pandas as pd
import numpy as np
df = pd.read_csv('BLAtest.csv')
df['dire'] = np.where(df['SEQS']<df['SEQE'],'FOR','BACK') 
forw = df.groupby("ID").filter(lambda x: (x['dire'] == 'FOR').any()) 
back = df.groupby("ID").filter(lambda x: (x['dire'] == 'BACK').any())
both = df.groupby("ID").filter(lambda x: (x['dire'] == 'BACK').any() & (x['dire'] == 'FOR').any())
propfor = (forw[~forw.ID.isin(both.ID)])
propback = (back[~back.ID.isin(both.ID)])
propfor.to_csv('forward.csv')
propback.to_csv('backwards.csv')
both.to_csv('test.csv')

Now does the job!

I got the code for the isin argument from HERE Not sure why I didn't do this earlier but I guess I was so stuck on trying to get filter to work, I didn't consider other options. This could probably be condensed into a much shorter few lines but it works and I understand why/how so it's okay for now!

How can I use groupby and filter functions on a dataframe without ending up with duplicates?

1 Answers1