1

I have pandas dataframe type in one column with string type like this:

    commits
0   12, 12, 9, 71, 145, 326, 315
1   54, 23, 265, 160, 164, 142
2   1, 335
3   6, 3, 21, 873
...

The data's type is below:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238089 entries, 0 to 238088
Data columns (total 1 columns):
commits    238089 non-null object
dtypes: object(1)
memory usage: 1.8+ MB

I would like to spilt it into separate column with integer type like this:

    0    1    2    3     4     5    6  
0   12   12   9    71   145   326   315
1   54   23   265  160  164   142
2   1    335  
3   6    3    21   873
...
  1. That is to say, each number is an integer now. The sequence of each row sholud not be changed.

  2. In the original dataset, each row has a different length of numbers. Is it possible that the spilt one also can keep different row length? That is to say, no Nan or None value occupies the empty place.

  3. If it is not possible to spilt without None or Nan, what is the easiest way to program?

  4. The new dataset can be numpy or dataframe type.

How to code this in python? Thanks.

wangmyde
  • 77
  • 8
  • Possible duplicate of [How to explode a list inside a Dataframe cell into separate rows](https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows) – gold_cy Feb 12 '19 at 17:08

2 Answers2

1

You could do:

import numpy as np
import pandas as pd

data = ['12, 12, 9, 71, 145, 326, 315',
        '54, 23, 265, 160, 164, 142',
        '1, 335',
        '6, 3, 21, 873']

df = pd.DataFrame(data=data, columns=['commits'])

result = pd.DataFrame([np.array(row) for row in df.commits.str.split(', ')]).fillna('')
print(result)

Output

    0    1    2    3    4    5    6
0  12   12    9   71  145  326  315
1  54   23  265  160  164  142     
2   1  335                         
3   6    3   21  873     

The trick is to convert each list into a numpy array so that pd.Dataframe fixes the jagged arrays, then use fillna to replace the NaN values with the empty space.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
1

Using str.split with expand=True:

df.commits.str.split(', ', expand=True).fillna('')

    0    1    2    3    4    5    6
0  12   12    9   71  145  326  315
1  54   23  265  160  164  142
2   1  335
3   6    3   21  873

Since you have missing data, it is not possible for all of your columns to be of an integer DataType. The closest you can get is having the columns be float with the missing values represented as NaN.

user3483203
  • 50,081
  • 9
  • 65
  • 94