2

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.

As you can see below the column has 15 category codes seen in the column header.

Original DataFrame

I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].

This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].

'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''

Shape of desired DatFrame

If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.

wwnde
  • 26,119
  • 6
  • 18
  • 32
Eoin Vaughan
  • 121
  • 1
  • 10
  • 3
    Welcome to StackOverflow, please read [how to ask a good question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Quang Hoang May 10 '20 at 00:08
  • 1
    Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard/52413247#52413247). [Stack Overflow Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). It is likely the question will be downvoted. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. – Trenton McKinney May 10 '20 at 00:17
  • @TrentonMcKinney Thanks for the advice, I'm new to using this an understand the need to provide valid details for people to help. – Eoin Vaughan May 10 '20 at 02:32

1 Answers1

1

You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.

You need to use a function and some loops to go through the columns.

First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.

data = {
    "1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
    "1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)

 1) Mail\n2) Email \n3) At PAC/TPAC  1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0                                   2                                                  5
1                                   1                                                  1
2                                   3                                                  4
3                                   2                                                  4
4                                   3                                                  2
5                                   1                                                  5
6                                   3                                                  1
7                                   2                                                  4
8                                   3                                                  3
9                                   1                                                  2

We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.

First, let's grab the first column.

s_col = df_full.iloc[:, 0]
print(s_col)

0    2
1    1
2    3
3    2
4    3
5    1
6    3
7    2
8    3
9    1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64

Split the header into individual pieces.

col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']

Clean up any leading or trailing white space.

col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']

Create a new dataframe from series and column heads.

data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
  1) Mail  2) Email  3) At PAC/TPAC
0        2         2               2
1        1         1               1
2        3         3               3
3        2         2               2
4        3         3               3
5        1         1               1
6        3         3               3
7        2         2               2
8        3         3               3
9        1         1               1

Create a copy to make changes to the values.

df_res = df.copy()

Go through the column headers, get the first number, then filter and apply bool.

for col in df.columns:
    value = pd.to_numeric(col[0])
    df_res.loc[df[col] == value, col] = 1
    df_res.loc[df[col] != value, col] = 0

print(df_res)
  1) Mail  2) Email  3) At PAC/TPAC
0        0         1               0
1        1         0               0
2        0         0               1
3        0         1               0
4        0         0               1
5        1         0               0
6        0         0               1
7        0         1               0
8        0         0               1
9        1         0               0

Now we have split a column into its components and assigned a bool value.

Let's step back and make the above a function so we can use it for each column in the original dataframe.

def split_column(s_col):
    # Split the header into individual pieces.
    col = s_col.name.split("\n")

    # Clean up any leading or trailing white space.
    col = [x.strip() for x in col]

    # Create a new dataframe from series and column heads.
    data = {col[x]: s_col.to_list() for x in range(len(col))}
    df = pd.DataFrame(data)

    # Create a copy to make changes to the values.
    df_res = df.copy()

    # Go through the column headers, get the first number, then filter and apply bool.
    for col in df.columns:
        value = pd.to_numeric(col[0])
        df_res.loc[df[col] == value, col] = 1
        df_res.loc[df[col] != value, col] = 0

    return df_res

Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.

for c in df_full.columns:
    # Call the function to get the split columns in a new dataframe.
    df_split = split_column(df_full[c])

    # Join it with the origianl full dataframe but drop the current column.
    df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)

print(df_full)
   1) Mail  2) Email  3) At PAC/TPAC  1) ACC  2) IM  3) PT  4) Smoking,  5) Cessation
0        0         1               0       0      0      0            0             1
1        1         0               0       1      0      0            0             0
2        0         0               1       0      0      0            1             0
3        0         1               0       0      0      0            1             0
4        0         0               1       0      1      0            0             0
5        1         0               0       0      0      0            0             1
6        0         0               1       1      0      0            0             0
7        0         1               0       0      0      0            1             0
8        0         0               1       0      0      1            0             0
9        1         0               0       0      1      0            0             0

Here is the full code...

data = {
    "1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
    "1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)


def split_column(s_col):
    # Split the header into individual pieces.
    col = s_col.name.split("\n")

    # Clean up any leading or trailing white space.
    col = [x.strip() for x in col]

    # Create a new dataframe from series and column heads.
    data = {col[x]: s_col.to_list() for x in range(len(col))}
    df = pd.DataFrame(data)

    # Create a copy to make changes to the values.
    df_res = df.copy()

    # Go through the column headers, get the first number, then filter and apply bool.
    for col in df.columns:
        value = pd.to_numeric(col[0])
        df_res.loc[df[col] == value, col] = 1
        df_res.loc[df[col] != value, col] = 0

    return df_res


for c in df_full.columns:
    # Call the function to get the split columns in a new dataframe.
    df_split = split_column(df_full[c])

    # Join it with the origianl full dataframe but drop the current column.
    df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)

print(df_full)
run-out
  • 3,114
  • 1
  • 9
  • 25
  • Thanks for this thorough explanation, I'm going to provide original replicatable dataframe going forward. I'm new to using pandas and python so I have an okay foundation of the methods and attributes associated with the pandas objects but I'm not very good at using loops to iterate over many data structures just yet but thanks for this great example. Saved me a lot of manual work. – Eoin Vaughan May 10 '20 at 16:00
  • Just a general mindset for pandas. If you are looping, you are likely doing it wrong. Of course this is a gross generalization and there are lots of applications for looping/iterating rows. However, when you are first starting, try to find the vectorized solution whenever you can and break out of the traditional mindset from programming. Vectorization is crazy fast and efficient. Good luck. – run-out May 10 '20 at 17:47
  • Yes, I've had vectorization in the back of mind when I was trying to sort this out as this approach has been introduced with pandas. Thanks for your help, its such a great library to work with, just dealing with a lot of projects with a legacy of working with excel but are now moving to using a web application to do the primary data collection so pandas is really helping with working with this existing data. – Eoin Vaughan May 10 '20 at 19:15