I have a pandas dataframe that looks like:
word | ss |
ABC | E |
DHS | H |
ABC | H |
ABC | C |
ABC | H |
SJA | C |
SJA | H |
But I'm trying to making something like:
word | H | E | C
ABC | 2 | 1 | 1
DHS | 1 | 0 | 0
SJA | 1 | 0 | 1
What I've done so far:
helix = []
sheet = []
coil = []
for ss in big_ss_list: # this is the ss column in a list form.
counter = counter + 1
print ('position', counter, 'of', len(big_ss_list))
if ss == 'H':
sheet = sheet + [0]
helix = helix + [1]
coil = coil + [0]
elif ss == 'E':
sheet = sheet + [1]
helix = helix + [0]
coil = coil + [0]
elif ss == 'C':
sheet = sheet + [0]
helix = helix + [0]
coil = coil + [1]
fdf = pd.DataFrame(list(zip(big_gram_list, big_ss_list, helix, sheet, coil)), columns =['word', 'secondary_structure', 'helix', 'sheet', 'coil'])
fdf = fdf.iloc[(fdf.word).argsort()] # sort based on alphabetically order
fdf = fdf.groupby('word',as_index=False).sum()
But this process is long because there are ~37mil rows and takes significant time to run.
Is there any quicker way?