3

I have a dataframe (totaldf) such that:

           ...     Hom   ...    March Plans   March Ships   April Plans   April Ships   ...

0                  CAD   ...    12              5           4             13
1                  USA   ...    7               6           2             11
2                  CAD   ...    4               9           6             14
3                  CAD   ...    13              3           9             7
...                ...   ...    ...             ...         ...           ...

for all months of the year. I would like it to be:

           ...     Hom   ...    Month   Plans    Ships    ...

0                  CAD   ...    March    12          5             
1                  USA   ...    March    7           6             
2                  CAD   ...    March    4           9             
3                  CAD   ...    March    13          3
4                  CAD   ...    April    4           13            
5                  USA   ...    April    2           11             
6                  CAD   ...    April    6           14             
7                  CAD   ...    April    9           7
...                ...   ...    ...      ...         ...

Is there an easy way to do this without splitting string entries? I have played around with totaldf.unstack() but since there are multiple columns I'm unsure as to how to properly reindex the dataframe.

poorpractice
  • 127
  • 10
  • 2
    Provide a [MCVE](https://stackoverflow.com/help/mcve) question. – meW Feb 21 '19 at 19:08
  • @meW I thought that I did that. Could you explain why this doesn't meet the criteria so I may correct it? Edit: I tried to format my question exactly like this question: https://stackoverflow.com/questions/18259067/unpivot-pandas-data and I see no difference in the quality or quantity of information. Please advise me where I went wrong. – poorpractice Feb 21 '19 at 19:10
  • 1
    Give a code to generate the above examples, even a sample will work. Also, add what have you tried so far. The above link gives complete desc of what I'm referring to. Also, [look](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – meW Feb 21 '19 at 19:12
  • @meW the reason I chose to exclude the code used to produce the dataframe is because it is a lengthy concatenation and rearranging of multiple dataframes. It's too long to look through and no one part provides a clear enough demonstration of how the dataframe has been created, hence I opted to omit it as I didn't think that it was pertinent to my example. The column names are there and both the sample output and desired output are included so I'm not too sure what's missing. – poorpractice Feb 21 '19 at 19:21
  • @ALollz I stated in the question at the top that columns exist for every month of the year. I will make the output more sensible. – poorpractice Feb 21 '19 at 19:23
  • @poorpractice Your expected output values are hard to correlate with original data. Also, I see 2 March within Month! Better add in question how are you arriving on them. – meW Feb 21 '19 at 19:27
  • @meW I fixed the output, hopefully this makes more sense now as it matches! – poorpractice Feb 21 '19 at 19:28
  • 1
    @ALollz I changed the output to match so that it makes sense now, my apologies. – poorpractice Feb 21 '19 at 19:28

2 Answers2

4

If you convert the columns to a MultiIndex you can use stack:

In [11]: df1 = df.set_index("Hom")

In [12]: df1.columns = pd.MultiIndex.from_tuples(df1.columns.map(lambda x: tuple(x.split())))

In [13]: df1
Out[13]:
    March       April
    Plans Ships Plans Ships
Hom
CAD    12     5     4    13
USA     7     6     2    11
CAD     4     9     6    14
CAD    13     3     9     7

In [14]: df1.stack(level=0)
Out[14]:
           Plans  Ships
Hom
CAD April      4     13
    March     12      5
USA April      2     11
    March      7      6
CAD April      6     14
    March      4      9
    April      9      7
    March     13      3

In [21]: res = df1.stack(level=0)

In [22]: res.index.names = ["Hom", "Month"]

In [23]: res.reset_index()
Out[23]:
   Hom  Month  Plans  Ships
0  CAD  April      4     13
1  CAD  March     12      5
2  USA  April      2     11
3  USA  March      7      6
4  CAD  April      6     14
5  CAD  March      4      9
6  CAD  April      9      7
7  CAD  March     13      3
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks for the answer! I tried this solution (I have additional columns which are between the 'Hom' column and the 'Month' column so I tried just including them in the index names. I receive an error "cannot reindex from a duplicate axis". I'm assuming that somewhere in my code I can include an "allow duplicates" parameter? Or do I need to do something else? I will update my original dataframe. – poorpractice Feb 21 '19 at 20:09
  • It's coming from line "df1.stack(level=0)" – poorpractice Feb 21 '19 at 20:19
  • If I call 'print(totaldf.columns)' I get: 'Index(['Hom', 'Origin', 'Contract', 'Customer', 'Destination', 'C/O Party', 'Summer', 'Pre-Winter', 'Winter', 'LD/Offload', 'March Plans', March Ships', 'April Plans', 'April Ships', ...], dtype='object')' – poorpractice Feb 21 '19 at 20:38
  • I did write the totaldf.set_index("Hom") and it doesn't appear to be having any issues there – poorpractice Feb 21 '19 at 20:41
  • I used index_series = pandas.Series(df1.index.values) and then print(index_series.value_counts()). It would appear that the duplicates are in the "Hom" column by ROWS – poorpractice Feb 21 '19 at 20:50
  • @poorpractice but that's also true in the original example / my code above. – Andy Hayden Feb 21 '19 at 20:51
  • you're absolutely correct, but for some reason that's what's causing the error? Edit: I found that code from this post, perhaps this will help. https://stackoverflow.com/questions/30986989/reindex-a-dataframe-with-duplicate-index-values – poorpractice Feb 21 '19 at 20:52
  • @poorpractice i don't think so. Are there duplicate columns? – Andy Hayden Feb 21 '19 at 20:53
  • No duplicate columns whatsoever – poorpractice Feb 21 '19 at 20:54
  • Try dropping the columns which don't match "Month Thing" – Andy Hayden Feb 21 '19 at 21:00
  • I dropped all columns between Loc and the first "March" entry, and still received the same erorr. – poorpractice Feb 21 '19 at 21:12
2

You can use pd.wide_to_long, with a little extra work in order to have the right stubnames, given that as mentioned in the docs:

The stub name(s). The wide format variables are assumed to start with the stub names.

So it will be necessary to slightly modify the column names so that the stubnames are at the beginning of each column name:

m = df.columns.str.contains('Plans|Ships')
cols = df.columns[m].str.split(' ')
df.columns.values[m] = [w+month for month, w in cols]

print(df)
   Hom  PlansMarch  ShipsMarch  PlansApril  ShipsApril
0  CAD          12           5           4          13
1  USA           7           6           2          11
2  CAD           4           9           6          14
3  CAD          13           3           9           7

Now you can use pd.wide_to_long using ['Ships', 'Plans'] as stubnames in order to obtain the output you want:

((pd.wide_to_long(df.reset_index(), stubnames=['Ships', 'Plans'], i = 'index', 
                j = 'Month', suffix='\w+')).reset_index(drop=True, level=0)
                .reset_index())

x  Month  Hom  Ships  Plans
0  March  CAD      5     12
1  March  USA      6      7
2  March  CAD      9      4
3  March  CAD      3     13
4  April  CAD     13      4
5  April  USA     11      2
6  April  CAD     14      6
7  April  CAD      7      9
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks for the reply @yatu. I updated my original example to include the fact that there are several additional columns in between the "Hom" column and the desired "Month" column. Will this affect the implementation of your solution at al? – poorpractice Feb 21 '19 at 20:15
  • I've added a small change to take this into consideration. Should work now @poorpractice :-) – yatu Feb 21 '19 at 20:17
  • I receive "arrays used as indices must be of integer (or boolean) type" in line"totaldf.columns[m].str.split(' ') – poorpractice Feb 21 '19 at 20:33
  • I added another update, did u try with the latest version? @poorpractice – yatu Feb 21 '19 at 21:15
  • now I get, in line: totaldf.columns.values[m] = [w+month for month, w in cols] a ValueError: not enough values to unpack (expected 2, got 1) – poorpractice Feb 21 '19 at 21:19
  • `df.columns[m].str.split(' ')` is assuming the columns like `March Plans` have a space between them. Make sure thats the case – yatu Feb 21 '19 at 21:24
  • I'm dumb, I don't know how I could have missed that – poorpractice Feb 21 '19 at 21:27
  • Finally managed to get it working, thank you! Last question - if I'd like to keep the "month" column in the current position, i.e. after all columns but before Ships/Plans, what would my best course of action be? – poorpractice Feb 21 '19 at 21:52
  • For that do `final_df[['Hom','Month','Ships','Plans']]`. Don't forget to accept if it solved it for you :-) – yatu Feb 21 '19 at 21:57