0

I have a pandas dataframe that looks like:

word | ss |
ABC  | E  |
DHS  | H  |
ABC  | H  |
ABC  | C  |
ABC  | H  |
SJA  | C  |
SJA  | H  |

But I'm trying to making something like:

word | H | E | C
ABC  | 2 | 1 | 1
DHS  | 1 | 0 | 0
SJA  | 1 | 0 | 1

What I've done so far:

helix = []
sheet = []
coil = []
for ss in big_ss_list: # this is the ss column​ in a list form.
    counter = counter + 1
    print ('position', counter, 'of', len(big_ss_list))
    if ss == 'H':
        sheet = sheet + [0]
        helix = helix + [1]
        coil = coil + [0]

    elif ss == 'E':
        sheet = sheet + [1]
        helix = helix + [0]
        coil = coil + [0]

    elif ss == 'C':
        sheet = sheet + [0]
        helix = helix + [0]
        coil = coil + [1]
fdf = pd.DataFrame(list(zip(big_gram_list, big_ss_list, helix, sheet, coil)), columns =['word', 'secondary_structure', 'helix', 'sheet', 'coil'])
fdf = fdf.iloc[(fdf.word).argsort()] # sort based on alphabetically order
fdf = fdf.groupby('word',as_index=False).sum()

But this process is long because there are ~37mil rows and takes significant time to run.

Is there any quicker way?

Haj Sai
  • 291
  • 3
  • 13

0 Answers0