0

I have a big text file like this small example:

small example:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS    chr16   70286198    70323446    transcript
AAMP    chr2    219130810   219134433   transcript
AARS2   chr6    44267391    44281063    transcript

I want to group the rows based on 3 columns (columns 2, 3 and 4). in fact if 2 or more lines have the same values in columns 2, 3 and 4, I want to get only one of the lines. for the small example, the expected output would look like this:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS2   chr6    44267391    44281063    transcript

I am trying to do that in python using pandas. as follow:

data = pd.read_csv("myfile")
df = pd.DataFrame(data)
res = df.groupby([0, 1, 2])
res.to_csv('outfile.txt', index=False)

but it does not return the correct results. do you know how to fix it?

john
  • 263
  • 1
  • 9

1 Answers1

0

The link I posted already had an answer but to solve this specific similar problem

import pandas as pd
a='''AAMP chr2 219130810 219134433 transcript
AAMP chr2 219132103 219134868 transcript
AARS chr16 70286198 70323446 transcript
AARS chr16 70287359 70292118 transcript
AARS chr16 70286198 70323446 transcript
AAMP chr2 219130810 219134433 transcript
AARS2 chr6 44267391 44281063 transcript'''

df=pd.DataFrame([i.split(' ') for i in a.split('\n')])
df.groupby([0,1,2]).first().reset_index()

Output:

AAMP    chr2    219130810   219134433   transcript
AAMP    chr2    219132103   219134868   transcript
AARS    chr16   70286198    70323446    transcript
AARS    chr16   70287359    70292118    transcript
AARS2   chr6    44267391    44281063    transcript
mad_
  • 8,121
  • 2
  • 25
  • 40