0

I have a table that is already grouped according to first column. I would like to split table into sub-tables with only the corresponding second column. I would like to use pandas or something else in python. I am not keen to use "awk" because that will require me to "subprocess" or "os". In the end I actually only need entries in second column separated according to first. The size of the table can be about 10000 rows X 6 columns.

These are similar posts that I found but I could not figure how to modify them for my purpose. Split pandas dataframe based on groupby

Splitting groupby() in pandas into smaller groups and combining them

The table/dataframe that I have looks like this:

P0A910  sp|A0A2C5WRC3|  84.136  0.0        100
P0A910  sp|A0A068Z9R6|  73.816  0.0         99
Q9HVD1  sp|A0A2G2MK84|  37.288  4.03e-34    99
Q9HVD1  sp|A0A1H2GM32|  40.571  6.86e-32    98
P09169  sp|A0A379DR81|  52.848  2.92e-117   99
P09169  sp|A0A127L436|  49.524  2.15e-108   98

And I would like it to be split like the following

group1:

P0A910  A0A2C5WRC3
P0A910  A0A068Z9R6

group2:

Q9HVD1  A0A2G2MK84
Q9HVD1  A0A1H2GM32

group3:

P09169  A0A379DR81
P09169  A0A127L436

OR into lists

P0A910:

A0A2C5WRC3
A0A068Z9R6

Q9HVD1:

A0A2G2MK84
A0A1H2GM32

P09169:

A0A379DR81
A0A127L436
akj
  • 1
  • 2

2 Answers2

0

So your problem is rather to separate the strings. Is it what you want:

new_col = df[1].str[3:-1]
list(new_col.groupby(df[0]))
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • in the above code, Line#1 removes the characters `sp| |`, and Line#2 lists the entire second column. `0 A0A2C5WRC3 1 A0A068Z9R6 2 A0A2G2MK84 ........ ` – akj Jun 18 '19 at 20:40
0

So I managed to get a solution of some sort. In this solution I managed to remove prefixes in the second and use groupby in pandas to group the entries by first column. Then, looped through it and wrote each group separately to csv files. I took help from @Quang 's answer and this link. It could probably be done in better ways but here is my code:

import pandas as pd

#read .csv as dataframe
data=pd.read_csv("BlastOut.csv")

#truncates sp| | from second column (['B']).
new_col=data['B'].str[3:-1]

#replaces second column with new_col
data['B']=new_col.to_frame(name=None) 

#groups dataframe by first column (['A'])
grouped=data.groupby('A')             

#loops through grouped items and writes each group to .csv file with title 
#of group ([group_name].csv)
for group_name, group in grouped:
    group.to_csv('Out_{}.csv'.format(group_name))  

Update- removed all columns except column of interest. This is a continuation to the previous code

import glob

#reads all csv files starting with "Out_" in filename
files=glob.glob("Out_*.csv")

#loop through all csv files 
for f in files:
     df=pd.read_csv(f, index_col=0)

     # Drop columns by column title (["A"])
     df.drop(["A"], axis=1, inplace=True)

     df.to_csv(f,index=False)
akj
  • 1
  • 2