I have a big text file like this small example:
small example:
AAMP chr2 219130810 219134433 transcript
AAMP chr2 219132103 219134868 transcript
AARS chr16 70286198 70323446 transcript
AARS chr16 70287359 70292118 transcript
AARS chr16 70286198 70323446 transcript
AAMP chr2 219130810 219134433 transcript
AARS2 chr6 44267391 44281063 transcript
I want to group the rows based on 3 columns (columns 2, 3 and 4
). in fact if 2 or more lines have the same values in columns 2, 3 and 4
, I want to get only one of the lines. for the small example, the expected output would look like this:
AAMP chr2 219130810 219134433 transcript
AAMP chr2 219132103 219134868 transcript
AARS chr16 70286198 70323446 transcript
AARS chr16 70287359 70292118 transcript
AARS2 chr6 44267391 44281063 transcript
I am trying to do that in python
using pandas
. as follow:
data = pd.read_csv("myfile")
df = pd.DataFrame(data)
res = df.groupby([0, 1, 2])
res.to_csv('outfile.txt', index=False)
but it does not return the correct results. do you know how to fix it?