3

I need to update a DataFrame column with some strings at selected rows, for which I have the index. So far, I managed to achieve what I need with list comprehension:

[data.particleIDs.values[idx[i]].append(particlenames[i]) for i in range(len(idx))]

where data.particleIDs is the DataFrame column that needs to be updated, particlenames a list containing the strings and idx an array containing, for each string, the DataFrame row it needs to be written on. Several strings correspond to the same row, and I need to write them all in the DataFrame column.

Let's say I have a DataFrame and the list of strings that I use to update it:

data = pd.DataFrame({'particleIDs': [[] for i in range(20)]}
particlenames = ['c15001'+str(i) for i in range(10))]

I have 10 strings and I need to use them to update the rows [7 8 15 8 11 0 15 1 12 8] in my DataFrame, i.e. I need to add each string to the corresponding row.

The FOR loop is terribly slow, as the actual particlenames list is long and I need to repeat this process several times.

Is there anything I can do to speed this up?

Thank you!

fede_
  • 65
  • 4

1 Answers1

0

I solved my problem by creating another dataframe for the strings and the corresponding indices:

df_strings = pd.DataFrame({'strings':particlenames,'rows':[7, 8, 15, 8, 11, 0, 15, 1, 12, 8]})

and then by using the groupby method on rows to append the strings with apply(list):

df_strings=df_strings.groupby('rows')['strings'].apply(list).reset_index()   

Finally, I join this new DataFrame with the one (data) that needs to be updated with the strings:

data=data.join(df_strings.set_index('rows'))

data=

    particleIDs     strings
0   []  [c150015]
1   []  [c150017]
2   []  NaN
3   []  NaN
4   []  NaN
5   []  NaN
6   []  NaN
7   []  [c150010]
8   []  [c150011, c150013, c150019]
9   []  NaN
10  []  NaN
11  []  [c150014]
12  []  [c150018]
13  []  NaN
14  []  NaN
15  []  [c150012, c150016]
16  []  NaN
17  []  NaN
18  []  NaN
19  []  NaN

So I can avoid adding the particleIDs when creating the data DataFrame (which, in my real case, has other columns), as the joined column contains the info I need.

fede_
  • 65
  • 4