1

Assume that we have this array in Python:

import pandas as pd
arr = pd.DataFrame(['aabbc','aabccca','aa'])

I want to split each row to columns of its character. The length of the rows may differ. It is the output that I expect to have (3*7 matrix in this case):

  1   2   3   4   5   6   7
1 a   a   b   b   c   Na  Na
2 a   a   b   c   c   c   a
3 a   a   Na  Na  Na  Na  Na

The number of the rows of my matrix is 20000 and I prefer not to use for loops. The original data is protein sequences. I read [1], [2], [3], etc, and they didn't help me.

cs95
  • 379,657
  • 97
  • 704
  • 746
Hadij
  • 3,661
  • 5
  • 26
  • 48

1 Answers1

3

Option 1
One simple way to do this is using a list comprehension.

pd.DataFrame([list(x) for x in arr[0]])

   0  1     2     3     4     5     6
0  a  a     b     b     c  None  None
1  a  a     b     c     c     c     a
2  a  a  None  None  None  None  None

Alternatively, use apply(list) which does the same thing.

pd.DataFrame(arr[0].apply(list).tolist())

   0  1     2     3     4     5     6
0  a  a     b     b     c  None  None
1  a  a     b     c     c     c     a
2  a  a  None  None  None  None  None

Option 2
Alternative with extractall + unstack. You'll end up with a multi-index of columns. You can drop the first level of the result.

v = arr[0].str.extractall(r'(\w)').unstack()
v.columns = v.columns.droplevel(0)

v

match  0  1     2     3     4     5     6
0      a  a     b     b     c  None  None
1      a  a     b     c     c     c     a
2      a  a  None  None  None  None  None

Option 3
Manipulating view -

v = arr[0].values.astype(str)
pd.DataFrame(v.view('U1').reshape(v.shape[0], -1))

   0  1  2  3  4  5  6
0  a  a  b  b  c      
1  a  a  b  c  c  c  a
2  a  a       

This gives you empty strings ('') instead of Nones in cells. Use replace if you want to add them back.

cs95
  • 379,657
  • 97
  • 704
  • 746