How to split a dataframe column into multiple columns

Question

After much prodding I am starting migrating my R scripts to Python. Most of my work in R involved data frames, and I am using the DataFrame object from the pandas package. In my script I need to read in a csv file and import the data into a DataFrame object. Next I need to convert the hex values into a column labelled DATA into bitwise data, and then create 16 new columns, one for each bit.

My example input data in file test.txt looks as follows,

PREFIX,TEST,ZONE,ROW,COL,DATA

6_6,READ,0, 0, 0,BFED

6_6,READ,0, 1, 0,BB7D

6_6,READ,0, 2, 0,FFF7

6_6,READ,0, 3, 0,E7FF

6_6,READ,0, 4, 0,FBF8

6_6,READ,0, 5, 0,DE75

6_6,READ,0, 6, 0,DFFE

My python script test.py is as follows,

import glob

import pandas as pd

import numpy as np

fname = 'test.txt'

df = pd.read_csv(fname, comment="#")

dfs = df[df.TEST == 'READ']

# function to convert the hexstring into a binary string

def hex2bin(hstr):

    return bin(int(hstr,16))[2:]


# convert the hexstring in column DATA to binarystring ROWDATA

dfs['BINDATA'] = dfs['DATA'].apply(hex2bin)

# get rid of the column DATA

del dfs['DATA']

When I run this script, and inspect the object dfs, I get the following,

PREFIX TEST ZONE ROW COL BINDATA

0 6_6 READ 0 0 0 1011111111101101

1 6_6 READ 0 1 0 1011101101111101

2 6_6 READ 0 2 0 1111111111110111

3 6_6 READ 0 3 0 1110011111111111

4 6_6 READ 0 4 0 1111101111111000

5 6_6 READ 0 5 0 1101111001110101

6 6_6 READ 0 6 0 1101111111111110

So now I am not sure how to split the column named BINDATA into 16 new columns (could be named B0, B0, B2, ...., B15). Any help will be appreciated.

Thanks & Regards,

Derric.

What did you come up with when trying to split the `BINDATA` column? — Don Cruickshank, Aug 03 '13 at 15:45

joris · Accepted Answer · 2013-08-03T16:32:04.743

I don't know if it can be done simpler (without the for loop), but this does the trick:

for i in range(16):
    dfs['B'+str(i)] = dfs['BINDATA'].str[i]

The str attribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:

In [20]: dfs
Out[20]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

If you want them as ints instead of strings, you can add .astype(int) in the for loop.

EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):

In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))

In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]

In [36]: dfs.join(splitted)
Out[36]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

pd.Series makes a series from every row/element of BINDATA. What is the logic of how 'splitted' turns out to be a dataframe? Thanks. — julieth, Aug 03 '13 at 18:33
`apply` on a series returns a DataFrame if the function wich is applied returns a Series for each element of the series, where the different elements of the returned Series become the values of the different columns of one row. So in this case first the `list` function converts the string in BINDATA to a list, which is then converted to a Series (see also the answer of @cpcloud, which does actually the same but is written a little bit different) — joris, Aug 03 '13 at 22:40

Phillip Cloud · Answer 2 · 2013-08-03T16:37:26.700

Here's how you can do this without a loop (but not really, since there's a lot of implicit looping in this code):

import pandas as pd

# read the above frame from the clipboard
df = pd.read_clipboard(converters={'BINDATA': str})
df = df.fillna(nan).replace('None', nan).dropna(axis=0, how='all')

# here are the lines that matter
bindata = df.BINDATA.apply(list).apply(Series)
bindata.columns = bindata.columns.map('B{0}'.format)
res = pd.concat([df, bindata], axis=1).convert_objects(convert_numeric=True)

How to split a dataframe column into multiple columns

2 Answers2

Linked