2

I have a dataframe

df = pd.DataFrame({'≤8': {1: '3687 55.5', 2: '838 66.5', 3: '8905 66.9'},
 '9–13': {1: '2234 33.6', 2: '419 33.3', 3: '3362 25.2'},
 '14–15': {1: '290 4.4', 2: nan, 3: '473 3.6'},
 '16–17': {1: '194 2.9', 2: nan, 3: '252 1.9'},
 '18–20': {1: '185 2.8', 2: nan, 3: '184 1.4'},
 '≥21': {1: '52 0.8', 2: '0 0.0', 3: '144 1.1'}})

          ≤8       9–13    14–15    16–17    18–20      ≥21
1  3687 55.5  2234 33.6  290 4.4  194 2.9  185 2.8   52 0.8
2   838 66.5   419 33.3      NaN      NaN      NaN    0 0.0
3  8905 66.9  3362 25.2  473 3.6  252 1.9  184 1.4  144 1.1

I want to split all the columns into two columns, so there is an int columns and a float column (note, I don't care to separate by type, the types are just coincidental). I got as far as splitting the columns, but I can't figure out how to assign the resulting lists into new columns. I also want to keep it as pythonic/pandonic as possible, so I don't want to loop over each column individually.

rev = gestation_cols.apply(lambda x: pd.Series([i for i in x.str.split(' ')]))

             ≤8          9–13       14–15       16–17       18–20         ≥21
0  [3687, 55.5]  [2234, 33.6]  [290, 4.4]  [194, 2.9]  [185, 2.8]   [52, 0.8]
1   [838, 66.5]   [419, 33.3]         NaN         NaN         NaN    [0, 0.0]
2  [8905, 66.9]  [3362, 25.2]  [473, 3.6]  [252, 1.9]  [184, 1.4]  [144, 1.1]
3  [1559, 48.6]  [1075, 33.5]  [209, 6.5]  [165, 5.1]  [173, 5.4]   [26, 0.8]

edit: For clarity, I do not want to split a single column or split each column individually. I know I could create new columns one by one, that's simply bad practice. I want each and every of the columns to be split into two columns.

rangeseeker
  • 395
  • 2
  • 9
  • Check out this question: https://stackoverflow.com/questions/37333299/splitting-a-pandas-dataframe-column-by-delimiter – D.Manasreh May 18 '22 at 20:20
  • 1
    In .str.split() you can use ```expand=True``` – D.Manasreh May 18 '22 at 20:21
  • @D.Manasreh for some reason the expand option does not work in this context. Thanks for the link to the question, but in that question they define the columns one by one. I want to create all of the columns at once. – rangeseeker May 18 '22 at 20:46

2 Answers2

2

For a vectorial version you can temporarily stack and use str.split:

df.stack().str.split().unstack()

Output:

             ≤8          9–13       14–15       16–17       18–20         ≥21
1  [3687, 55.5]  [2234, 33.6]  [290, 4.4]  [194, 2.9]  [185, 2.8]   [52, 0.8]
2   [838, 66.5]   [419, 33.3]         NaN         NaN         NaN    [0, 0.0]
3  [8905, 66.9]  [3362, 25.2]  [473, 3.6]  [252, 1.9]  [184, 1.4]  [144, 1.1]

To have multiple columns use the expand=True parameter, you will end up with a MultiIndex that you can then rework as you see fit:

df.stack().str.split(expand=True).unstack()

Output:

      0                                  1                             
     ≤8  9–13 14–15 16–17 18–20  ≥21    ≤8  9–13 14–15 16–17 18–20  ≥21
1  3687  2234   290   194   185   52  55.5  33.6   4.4   2.9   2.8  0.8
2   838   419   NaN   NaN   NaN    0  66.5  33.3   NaN   NaN   NaN  0.0
3  8905  3362   473   252   184  144  66.9  25.2   3.6   1.9   1.4  1.1

reorganizing the order

(df.stack()
   .str.split(expand=True)
   .unstack()
   .swaplevel(axis=1)
   [df.columns]
)

Output:

     ≤8        9–13       14–15      16–17      18–20       ≥21     
      0     1     0     1     0    1     0    1     0    1    0    1
1  3687  55.5  2234  33.6   290  4.4   194  2.9   185  2.8   52  0.8
2   838  66.5   419  33.3   NaN  NaN   NaN  NaN   NaN  NaN    0  0.0
3  8905  66.9  3362  25.2   473  3.6   252  1.9   184  1.4  144  1.1
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Absolutely awesome, thank you. Is it 100% necessary to use stack? Like, is there another way to achieve the same outcome? It just seems a little obtuse to me to reorganize the data frame like that only to undo it. – rangeseeker May 18 '22 at 20:47
  • 2
    Unfortunately, yes this is to have a Series for `str.split`. But it's not too hard to reorganize. I'll provide an example – mozway May 18 '22 at 20:50
1

It should be faster if you iterate through it (you are iterating through the columns, which is ok compared to row looping):

pd.concat({key : value.str.split(expand = True) 
           for key, value in df.items()}, 
           axis = 1)

     ≤8        9–13       14–15      16–17      18–20       ≥21
      0     1     0     1     0    1     0    1     0    1    0    1
1  3687  55.5  2234  33.6   290  4.4   194  2.9   185  2.8   52  0.8
2   838  66.5   419  33.3   NaN  NaN   NaN  NaN   NaN  NaN    0  0.0
3  8905  66.9  3362  25.2   473  3.6   252  1.9   184  1.4  144  1.1

Of course, only tests can tell - it feels a bit more expensive to flip to long (stack), then back to wide (unstack) - when you can run the process within the wide form.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31