I have a CSV file with a simple table like this:
ID PER SEQS SEQE dire
AC037199.2 68.027 9674 9818 A
AC037199.2 68.919 19131 18996 A
AF243527.1 68.919 75530 75395 A
AF243527.1 70.192 97025 96928 A
XM_01194.1 73.077 133 230 A
XM_01194.1 71.605 367 525 A
For context, these are IDs of different GenBank entries and are showing the length of each nucleotide sequence that was a match to my query sequence. Dire is a placeholder column
I am using python (specifically numpy and pandas) to:
- Group the entries based on the ID column
- Using a simple
SEQS > SEQE
, identify if the reads are forwards (increasing in size), backwards (decreasing in size) or both. Then modify the 'dire' column to either FOR or BACK - Print these results into a new CSV file (creatively called "for.csv", "back.csv" or both.csv") which shows me what reads the directions are going in for each ID.
I have managed to figure out how to do this using the command line (thank you awk and sed!) but if I could get it done in 1 script as opposed to a few lines, that would be superb.
Using examples from HERE and HERE I have managed to get a script that nearly achieves this:
import pandas as pd
import numpy as np
df = pd.read_csv('BLAtest.csv')
df['dire'] = np.where(df['SEQS']<df['SEQE'],'FOR','BACK') #
forw = df.groupby("ID").filter(lambda x: any(x['dire'] == 'FOR') & any(x['dire'] != 'BACK'))
back = df.groupby("ID").filter(lambda x: any(x['dire'] == 'BACK')& any(x['dire'] != 'FOR'))
both = df.groupby("ID").filter(lambda x: any((x['dire'] == 'BACK')) & any(x['dire'] == 'FOR'))
forw.to_csv('forw.csv')
back.to_csv('back.csv')
both.to_csv('test.csv')
The problem: I am getting entries in for.csv and back.csv that have entires that should only be printed in both.csv. I am only interested in the grouped IDs which are all one direction. Using the output from above as an example:
ID PER SEQS SEQE dire
AC037199.2 68.027 9674 9818 FOR
AC037199.2 68.919 19131 18996 BACK
AF243527.1 68.919 75530 75395 BACK
AF243527.1 70.192 97025 96928 BACK
XM_01194.1 73.077 133 230 FOR
XM_01194.1 71.605 367 525 FOR
In my for.csv, I am getting both AC037199.2 and XM_01194.1, when I only want XM_01194.1!
How can I modify my script to avoid these duplicates in the first place? Or can I modify it as is to remove these afterwards?
Thank you kindly in advance for any help and hope I've explained myself well enough. I'm hoping I can be nearly done with this bulk of data by the end of the week and it's just this final hurdle!!