manipulating the data in python using pandas

Question

I have a big text file like this small example:

small example:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS    chr16   70286198    70323446    transcript
AAMP    chr2    219130810   219134433   transcript
AARS2   chr6    44267391    44281063    transcript

I want to group the rows based on 3 columns (columns 2, 3 and 4). in fact if 2 or more lines have the same values in columns 2, 3 and 4, I want to get only one of the lines. for the small example, the expected output would look like this:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS2   chr6    44267391    44281063    transcript

I am trying to do that in python using pandas. as follow:

data = pd.read_csv("myfile")
df = pd.DataFrame(data)
res = df.groupby([0, 1, 2])
res.to_csv('outfile.txt', index=False)

but it does not return the correct results. do you know how to fix it?

@jezrael to me it looks the same. I don't see a logical difference. — mad_, Nov 08 '18 at 14:43
the `subset` argument to [drop_duplicats](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) is what you need — Maarten Fabré, Nov 08 '18 at 14:47
pd.read_csv() already returns a dataframe - no need to do an extra step — NotSoShabby, Nov 08 '18 at 14:58
Also, your reading a text file using read csv, but it seems that your delimiter is not a comma. Use the delim_whitespace or delimiter options in the read_csv funtion to get the correct dataframe — NotSoShabby, Nov 08 '18 at 15:00

score 0 · Answer 1 · answered Nov 08 '18 at 14:56

The link I posted already had an answer but to solve this specific similar problem

import pandas as pd
a='''AAMP chr2 219130810 219134433 transcript
AAMP chr2 219132103 219134868 transcript
AARS chr16 70286198 70323446 transcript
AARS chr16 70287359 70292118 transcript
AARS chr16 70286198 70323446 transcript
AAMP chr2 219130810 219134433 transcript
AARS2 chr6 44267391 44281063 transcript'''

df=pd.DataFrame([i.split(' ') for i in a.split('\n')])
df.groupby([0,1,2]).first().reset_index()

Output:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS2   chr6    44267391    44281063    transcript

manipulating the data in python using pandas

1 Answers1