I have several fairly large dataframes (>1 million rows). In one column are strings of varying lengths. I would like to split these strings into individual characters, with each individual character placed into a column.
I can do this using pd.DataFrame.apply()
-- see below -- however it's far too slow to use practically (and it also has a tendency to crash the kernal).
import pandas as pd
df = pd.DataFrame(['AAVFD','TYU?W_Z', 'SomeOtherString', 'ETC.'], columns = ['One'])
print df
One
0 AAVFD
1 TYU?W_Z
2 SomeOtherString
3 ETC.
Convert strings to lists of varying lengths:
S1 = df.One.apply(list)
print S1
0 [A, A, V, F, D]
1 [T, Y, U, ?, W, _, Z]
2 [S, o, m, e, O, t, h, e, r, S, t, r, i, n, g]
3 [E, T, C, .]
Name: One, dtype: object
Put each individual character into a column:
df2 = pd.DataFrame(S1.values.tolist())
print df2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 A A V F D None None None None None None None None None
1 T Y U ? W _ Z None None None None None None None
2 S o m e O t h e r S t r i n
3 E T C . None None None None None None None None None None
14
0 None
1 None
2 g
3 None
Unfortunately, this is quite slow. It seems like I should be able to vectorize this somehow by directly dealing with the numpy array underlying the df.One
column. However, when I've tried that I think that it has difficulty with the fact that the strings vary in length.