1

I have a large spreadsheet which I have imported using pandas.

The first x set of columns can be considered necessary for every row. The next y set of columns are actually just 9 repeated column headers in 7 groups, but with different values for each row. I want to try and create a new dataframe, which separates out each of those repeated column headers and joins them with the necessary x columns. I have tried many ways, but I cannot get it to work.

As a simple example, lets say we have the following dataframe:

ID Name Category Year Choice Price Year Choice Price Year Choice Price
1 Test AM 2022 A 10 2022 B 11 2023 B 11.5

This should be transformed into this format:

ID Name Category Year Choice Price
1 Test AM 2022 A 10
1 Test AM 2022 B 11
1 Test AM 2023 B 11.5

I feel like this should be a whole lot easier than it has been. Please help out.

NOTE: There are actually 18 necessary headers, followed by 7 groups of 9 headers. The 7 groups all have the same header name (which is possibly why my efforts to use .drop and .iloc have not worked properly. I am a python beginner.

Latest attempt was to create a new copy of the df for each of the blocks. This produces an error

ValueError: operands could not be broadcast together with shapes (9,) (44,)

new_dfA = dfA.drop(dfA.columns[27:], axis=1).reset_index(drop=True)
new_dfB = dfB.drop(dfB.columns[18:27] + list(df.columns[37:]), axis=1).reset_index(drop=True)
new_dfC = dfC.drop(dfC.columns[18:36] + list(df.columns[46:]), axis=1).reset_index(drop=True)
new_dfD = dfD.drop(dfD.columns[18:45] + list(df.columns[55:]), axis=1).reset_index(drop=True)
new_dfE = dfE.drop(dfE.columns[18:54] + list(df.columns[64:]), axis=1).reset_index(drop=True)
new_dfF = dfF.drop(dfF.columns[18:63] + list(df.columns[72:]), axis=1).reset_index(drop=True)
new_dfG = dfG.drop(dfG.columns[18:72] + list(df.columns[80:]), axis=1).reset_index(drop=True)
Marcelo Paco
  • 2,732
  • 4
  • 9
  • 26
  • 1
    Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Mar 23 '23 at 20:22

1 Answers1

1

You can use pd.wide_to_long() to achieve this. The only caveat is that you have to rename your df columns Year, Choice and Price to the following:

   ID  Name Category  Year.0 Choice.0  Price.0  Year.1 Choice.1  Price.1  Year.2 Choice.2  Price.2
0   1  Test       AM    2022        A       10    2022        B       11    2023        B     11.5

Once that is done you can do the following:

df = pd.wide_to_long(df, stubnames=['Year', 'Choice', 'Price'], i=['ID', 'Name', 'Category'], j='dropme', sep='.')\
.reset_index()\
.drop('dropme', axis=1)\
.sort_values(["ID", "Name", "Category"])

Now df should look like:

   ID  Name Category  Year Choice  Price
0   1  Test       AM  2022      A   10.0
1   1  Test       AM  2022      B   11.0
2   1  Test       AM  2023      B   11.5
Marcelo Paco
  • 2,732
  • 4
  • 9
  • 26
  • That is superb, thank you. It took some fiddling with to get the column headers to rename correctly, but it is all done now using this code: df.columns.values[18:27] = [f'{col}.0' for col in df.columns.values[18:27]] – Jon Garnett-Smith Mar 24 '23 at 09:19