Efficiently convert a column of strings into several columns of individual characters in pandas

Question

I have several fairly large dataframes (>1 million rows). In one column are strings of varying lengths. I would like to split these strings into individual characters, with each individual character placed into a column.

I can do this using pd.DataFrame.apply() -- see below -- however it's far too slow to use practically (and it also has a tendency to crash the kernal).

import pandas as pd

df = pd.DataFrame(['AAVFD','TYU?W_Z', 'SomeOtherString', 'ETC.'], columns = ['One'])

print df
    One
0   AAVFD
1   TYU?W_Z
2   SomeOtherString
3   ETC.

Convert strings to lists of varying lengths:

S1 = df.One.apply(list)
print S1
0                                  [A, A, V, F, D]
1                            [T, Y, U, ?, W, _, Z]
2    [S, o, m, e, O, t, h, e, r, S, t, r, i, n, g]
3                                     [E, T, C, .]
Name: One, dtype: object

Put each individual character into a column:

df2 = pd.DataFrame(S1.values.tolist())
print df2
   0  1  2  3     4     5     6     7     8     9    10    11    12    13  \
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None   
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None   
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n   
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None   

     14  
0  None  
1  None  
2     g  
3  None

Unfortunately, this is quite slow. It seems like I should be able to vectorize this somehow by directly dealing with the numpy array underlying the df.One column. However, when I've tried that I think that it has difficulty with the fact that the strings vary in length.

Paul Panzer · Accepted Answer · 2017-12-27T06:27:31.763

2

I hardly know pandas but the numpy side of the operation can be done like so (on Python 3; use 'S1' in place of 'U1' on Python 2):

npchrs = df.values.astype(str).view('U1')
# array([['A', 'A', 'V', 'F', 'D', '', '', '', '', '', '', '', '', '', ''],
#        ['T', 'Y', 'U', '?', 'W', '_', 'Z', '', '', '', '', '', '', '', ''],
#        ['S', 'o', 'm', 'e', 'O', 't', 'h', 'e', 'r', 'S', 't', 'r', 'i', 'n', 'g'],
#        ['E', 'T', 'C', '.', '', '', '', '', '', '', '', '', '', '', '']],
#       dtype='<U1')

If you are ok with empty strings instead of Nones or if replacing them in pandas is easy, you can convert this back to df and be done.

According to @COLDSPEED's timings the following step is slow, so it'd be better if you could avoid it. If not:

npobjs = npchrs.astype(object)
npobjs[npobjs==''] = None
# array([['A', 'A', 'V', 'F', 'D', None, None, None, None, None, None, None,
#         None, None, None],
#        ['T', 'Y', 'U', '?', 'W', '_', 'Z', None, None, None, None, None,
#         None, None, None],
#        ['S', 'o', 'm', 'e', 'O', 't', 'h', 'e', 'r', 'S', 't', 'r', 'i', 'n', 'g'],
#        ['E', 'T', 'C', '.', None, None, None, None, None, None, None, None,
#         None, None, None]], dtype=object)

edited Dec 27 '17 at 06:27

answered Dec 27 '17 at 04:50

Paul Panzer

51,835
3
54
99

Hmm, I'm having trouble getting the first bit to work, maybe because I'm using Python 2? I'm getting "ValueError: new type not compatible with array." – Ben Dec 27 '17 at 05:13
@Ben Yeah, there was this unicode transition when they moved to Python3. Could you please try whether using 'S1' instead of 'U1' works for you? – Paul Panzer Dec 27 '17 at 05:17
Indeed it does. Thanks! – Ben Dec 27 '17 at 05:19
So, a py2 solution would be `pd.DataFrame(df.values.astype(str).view('S1')).replace('', np.nan)` – cs95 Dec 27 '17 at 05:49
for the curious, it looks like @Paul Panzer's method for breaking up the strings into lists is ~10x faster than using df.apply(). Likewise, putting the new array back into a dataframe is also ~10x faster than the method I have written in my question. Definitely better! – Ben Dec 27 '17 at 05:51
@Ben I timed all our solutions, and this happens to be as slow as apply for large data (in fact, it is slower, how odd). – cs95 Dec 27 '17 at 05:55
@COLDSPEED That's less good. I timed it using my toy example, not my actual dataset... – Ben Dec 27 '17 at 05:59
@Ben Less good indeed. Never time solutions on small datasets. Please see my answer. Disclaimer: Timings may vary on your machine. – cs95 Dec 27 '17 at 06:08
@PaulPanzer Sorry, realised my "Option 2" was exactly your answer, so removed it. Not converting to object makes the solution much faster, and you should mention that. – cs95 Dec 27 '17 at 06:20
1

Aight, happy holidays :-) – cs95 Dec 27 '17 at 06:29

cs95 · Answer 2 · 2017-12-27T06:14:07.857

An alternative using a list comprehension, which I think should be pretty fast -

df = pd.DataFrame([list(x) for x in df.One])
df

  0  1  2  3     4     5     6     7     8     9     10    11    12    13  \
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None   
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None   
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n   
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None   

     14  
0  None  
1  None  
2     g  
3  None

Timings

df = pd.concat([df] * 10000, ignore_index=True)

# original answer
%timeit pd.DataFrame(df.One.apply(list).values.tolist())
10 loops, best of 3: 36.1 ms per loop

# Paul Panzer's answer
%%timeit
npchrs = df.values.astype(str).view('U1')
npobjs = npchrs.astype(object)
npobjs[npobjs==''] = None
pd.DataFrame(npobjs)

10 loops, best of 3: 37.5 ms per loop

# My list comp answer 
%timeit pd.DataFrame([list(x) for x in df.One.values])
10 loops, best of 3: 32.8 ms per loop

# improved version of Paul Panzer's answer
%timeit pd.DataFrame(df.values.astype(str).view('U1'))
10 loops, best of 3: 20.1 ms per loop

Disclaimer - Timings vary based on the data, python version, environment, and OS.

score 2 · Answer 3 · answered Dec 27 '17 at 08:25

Here's one approach using string-join, np.fromstring and masking (idea borrowed from this post) -

def join_mask(df):
    lens = np.array([len(i) for i in df.One])
    n = lens.max()
    out = np.full((len(df),n), None)
    out[lens[:,None] > np.arange(n)] = np.fromstring(''.join(df.One), dtype='S1')
    return pd.DataFrame(out)

Sample run -

In [160]: df
Out[160]: 
               One
0            AAVFD
1          TYU?W_Z
2  SomeOtherString
3             ETC.

In [161]: join_mask(df)
Out[161]: 
  0  1  2  3     4     5     6     7     8     9     10    11    12    13    14
0  A  A  V  F     D  None  None  None  None  None  None  None  None  None  None
1  T  Y  U  ?     W     _     Z  None  None  None  None  None  None  None  None
2  S  o  m  e     O     t     h     e     r     S     t     r     i     n     g
3  E  T  C  .  None  None  None  None  None  None  None  None  None  None  None

Timings

Using @cᴏʟᴅsᴘᴇᴇᴅ's timing setup on approaches that produce the correct None filled output df -

In [173]: df = pd.concat([df] * 10000, ignore_index=True)

# original answer
In [175]: %timeit pd.DataFrame(df.One.apply(list).values.tolist())
10 loops, best of 3: 27.2 ms per loop

# @Paul Panzer's answer
In [176]: %%timeit
     ...: npchrs = df.values.astype(str).view('S1')
     ...: npobjs = npchrs.astype(object)
     ...: npobjs[npobjs==''] = None
     ...: pd.DataFrame(npobjs)
10 loops, best of 3: 20.3 ms per loop

# @cᴏʟᴅsᴘᴇᴇᴅ's answer 
In [177]: %timeit pd.DataFrame([list(x) for x in df.One.values])
10 loops, best of 3: 27.6 ms per loop

# Using solution in this post
In [178]: %timeit join_mask(df)
100 loops, best of 3: 13.8 ms per loop

Future of NumPy :) https://www.youtube.com/watch?v=fowHwlpGb34 Sparse arrays, categorical data, JIT compiled and lot more! — kmario23, Dec 27 '17 at 08:45
`lens = df.One.str.len().values` should be faster than the list comp (maybe). — cs95, Dec 27 '17 at 10:41
@cᴏʟᴅsᴘᴇᴇᴅ Well I need `lens` later on. That `df.One.str.len().values` seems a tiny bit slower at my end. Appreciate the inputs! — Divakar, Dec 27 '17 at 10:46

Efficiently convert a column of strings into several columns of individual characters in pandas

3 Answers3