1

I'm trying to use Pandas to read an excel file from a survey result sheet (on the rows there are the participants), but I got many variables split into multiple columns, like this

>>> df.columns
Index([ ... , 'Age', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16', ...], dtype='object', length=256)

where each unnamed column after 'Age' and the before the next named column contains only the values of the Age variable corresponding to a single choice from that multiple choice question. How do I get all the Age values under the same column?

Edit: example of the output of df.head(5).to_dict():

{...,
'Gender': {0: 'M', 1: 'M', 2: 'M', 3: nan, 4: nan},
 'Unnamed: 10': {0: 'F', 1: nan, 2: nan, 3: 'F', 4: 'F'},
 'Age': {0: 25.0, 1: nan, 2: 25.0, 3: nan, 4: nan},
 'Unnamed: 12': {0: 26.0, 1: nan, 2: nan, 3: 26.0, 4: nan},
 'Unnamed: 13': {0: 27.0, 1: nan, 2: nan, 3: nan, 4: nan},
 'Unnamed: 14': {0: 28.0, 1: nan, 2: nan, 3: nan, 4: 28.0},
 'Unnamed: 15': {0: 29.0, 1: nan, 2: nan, 3: nan, 4: nan},
 'Unnamed: 16': {0: 30.0, 1: nan, 2: nan, 3: nan, 4: nan},
...}
garpez
  • 13
  • 4
  • 1
    Could you include an example of the Excel file you are reading in? That would be helpful to understand the issue better. – whege Oct 02 '20 at 16:28
  • Is it all in a single line like that? If its not tabular then that would be an issue. A screenshot of your Excel worksheet would be nice if possible. – whege Oct 02 '20 at 16:34
  • Yeah, it's tabular, sorry... I'll take a screenshot – garpez Oct 02 '20 at 16:35
  • There it is: https://imgur.com/a/2mqsY1I – garpez Oct 02 '20 at 16:39
  • @garpez see that image you have, copy and paste those cells, and paste it in your question then format as code. or do `print(df.head(5).to_dict())` and paste that into your question and format as code. – Umar.H Oct 02 '20 at 17:09

2 Answers2

2

Step one, let's remove the Unnamed: columns, then forward fill the values:

df.columns = df.columns.to_series().replace('Unnamed:\s\d+',np.nan,regex=True).ffill().values

print(df)

  Gender Gender   Age   Age   Age   Age   Age   Age
0      M      F  25.0  26.0  27.0  28.0  29.0  30.0
1      M    NaN   NaN   NaN   NaN   NaN   NaN   NaN
2      M    NaN  25.0   NaN   NaN   NaN   NaN   NaN
3    NaN      F   NaN  26.0   NaN   NaN   NaN   NaN
4    NaN      F   NaN   NaN   NaN  28.0   NaN   NaN

Then we can reshape your dataframe and create a new index so we can unstack:

s = df.T.agg(list,1).explode().dropna().to_frame()

df1 = s.set_index(s.groupby(level=0).cumcount(),append=True).unstack(0)


print(df1)

  Age Gender
0  25      M
1  25      M
2  26      M
3  26      F
4  27      F
5  28      F
6  28    NaN
7  29    NaN
8  30    NaN

Another method would be to create a multi index of your columns, which is better as you keep your original indices:

df.columns = df.columns.to_series()\
               .replace('Unnamed:\s\d+',np.nan,regex=True).ffill().values
df.columns = pd.MultiIndex.from_tuples([(x,y)for x,y in 
                 zip(df.columns,df.columns.to_series().groupby(level=0).cumcount())])


print(df)

  Gender        Age                              
       0    1     0     1     2     3     4     5
0      M    F  25.0  26.0  27.0  28.0  29.0  30.0
1      M  NaN   NaN   NaN   NaN   NaN   NaN   NaN
2      M  NaN  25.0   NaN   NaN   NaN   NaN   NaN
3    NaN    F   NaN  26.0   NaN   NaN   NaN   NaN
4    NaN    F   NaN   NaN   NaN  28.0   NaN   NaN


print(df.stack(1))

      Age Gender
0 0  25.0      M
  1  26.0      F
  2  27.0    NaN
  3  28.0    NaN
  4  29.0    NaN
  5  30.0    NaN
1 0   NaN      M
2 0  25.0      M
3 1  26.0      F
4 1   NaN      F
  3  28.0    NaN
halfer
  • 19,824
  • 17
  • 99
  • 186
Umar.H
  • 22,559
  • 7
  • 39
  • 74
0

This solution is a little ugly, but it should work. Essentially, you subset the dataframe to pick out all of the columns associated with a particular question. Second, you use a function to pick out the first value in each row which is not NaN.

df = df.drop([0])  # Drop first row, contains column headings

# This function treats each row as a Series. It then gets the value
# of the first defined cell, and returns it. Or, if the row is all
# None, it returns None.
def get_first_valid_from_row(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]

new_df = pd.DataFrame()

# Get gender-related columns
gender_subset_df = df[["Gender", "Unnamed: 10"]]
new_df["Gender"] = gender_subset_df.apply(get_first_valid_from_row, axis=1)

# Get age-related columns
age_subset_df = df[["Age", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14", "Unnamed: 15", "Unnamed: 16"]]
new_df["Age"] = age_subset_df.apply(get_first_valid_from_row, axis=1)
print(new_df)

I get the following result:

  Gender   Age
1      M   NaN
2      M  25.0
3      F  26.0
4      F  28.0

This has a invalid age for the first row, but looking at your original data, it appears that the first person in your dataset did not select any choice for age.

Credit to @EdChum for this approach.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Thank you @Nick ODell and @EdChum! This works, although it is a bit time consuming, since I have to do this for more than 20 variables (~300 unnamed columns) – garpez Oct 02 '20 at 22:32
  • 1
    if you rename the columns and forward fill you can do `print(df.groupby(level=0,axis=1).first())` to get the same result. – Umar.H Oct 03 '20 at 00:51