1

I have the dataframe of following type-

Year MN DT HR01 HR02 HR03 HR04 PRMEAN PRSMAX PRSMIN
2011 1  1    34   35   67   78     12    235    256
2011 1  2    12   56   78   56     23    256    267
2011 2  1    12   2    32   12     74    221    435

where MN:Months, DT:Date,HR:Hour .Here I want the HR data to be aligned column wise to its date in the following way.

Year MN DT HR PRMEAN PRSMAX PRSMIN
2011 1  1  34     12    235    256
2011 1  1  35     12    235    256
2011 1  1  67     12    235    256
2011 1  1  78     12    235    256
2011 1  2  12     23    256    267
2011 1  2  56     23    256    267
2011 1  2  78     23    256    267
2011 1  2  56     23    256    267
2011 2  1  12     74    221    435
2011 2  1  2     74     221     435
2011 2  1  32     74    221    435
2011 2  1  12     74    221    435

I am really not been able to figure out , how to do it?

Bing
  • 631
  • 1
  • 8
  • 23
  • I am getting `Key Error:Year` by using this command – Bing Feb 01 '19 at 07:47
  • Still the same error..( – Bing Feb 01 '19 at 07:50
  • You can also add `.sort_values(['PRSMIN'])` to the end of the above expression in order to have the rows appear in the order depicted in your desired output. – James Dellinger Feb 01 '19 at 07:51
  • This is `Index(['index', 'YEAR ', 'MN ', 'DT ', '..HR01', '..HR02', '..HR03', '..HR04', '..HR05', '..HR06', '..HR07', '..HR08', '..HR09', '..HR10', '..HR11', '..HR12', '..HR13', '..HR14', '..HR15', '..HR16', '..HR17', '..HR18', '..HR19', '..HR20', '..HR21', '..HR22', '..HR23', '..HR24', 'PRMEAN ', 'PRSMAX ', 'PRSMIN'], dtype='object')` – Bing Feb 01 '19 at 07:52
  • 1
    Now I am getting `key Error :MN` , what is happening here? – Bing Feb 01 '19 at 08:05

1 Answers1

2

Data:

print (df)
      MN   DT   HR01  HR02  HR03  HR04  PRMEAN  PRSMAX  PRSMIN
2011    1    1    34    35    67    78      12     235     256
2011    1    2    12    56    78    56      23     256     267
2011    2    1    12     2    32    12      74     221     435

If check columns thre is no column Year (becuse index) and columns MN and DT have traling whitespaces:

print (df.columns)
Index(['MN ', 'DT ', 'HR01', 'HR02', 'HR03', 'HR04', 'PRMEAN', 'PRSMAX',
       'PRSMIN'],
      dtype='object')

Solution is remove them by str.strip, convert index to column and rename:

df.columns = df.columns.str.strip()
df = df.reset_index().rename(columns={'index':'Year'})

print (df)
   Year  MN  DT  HR01  HR02  HR03  HR04  PRMEAN  PRSMAX  PRSMIN
0  2011   1   1    34    35    67    78      12     235     256
1  2011   1   2    12    56    78    56      23     256     267
2  2011   2   1    12     2    32    12      74     221     435

print (df.columns)
Index(['Year', 'MN', 'DT', 'HR01', 'HR02', 'HR03', 'HR04', 'PRMEAN', 'PRSMAX',
       'PRSMIN'],
      dtype='object')

Then use melt with sort_values and remove variable column:

df1 = (df.melt(id_vars=['Year','MN','DT','PRMEAN','PRSMAX','PRSMIN'],value_name='HR')
         .sort_values('variable')
         .drop('variable', axis=1))

Last change ordering of columns:

df1 = df1[['Year','MN','DT','HR','PRMEAN','PRSMAX','PRSMIN']]
print (df1)
    Year  MN  DT  HR  PRMEAN  PRSMAX  PRSMIN
0   2011   1   1  34      12     235     256
3   2011   1   1  35      12     235     256
6   2011   1   1  67      12     235     256
9   2011   1   1  78      12     235     256
1   2011   1   2  12      23     256     267
4   2011   1   2  56      23     256     267
7   2011   1   2  78      23     256     267
10  2011   1   2  56      23     256     267
2   2011   2   1  12      74     221     435
5   2011   2   1   2      74     221     435
8   2011   2   1  32      74     221     435
11  2011   2   1  12      74     221     435

EDIT:

So there is column YEAR, so need:

df = pd.read_csv('Pres.csv')

df.columns = df.columns.str.strip()

df1 = (df.melt(id_vars=['YEAR','MN','DT','PRMEAN','PRSMAX','PRSMIN'],
               value_name='HR',
               var_name='HOUR')
         .assign(HOUR = lambda x: x.HOUR.str.extract('(\d+)'))
         .sort_values(['YEAR','MN','DT','HOUR'])
         )

print (df1.head(30))
       YEAR  MN  DT PRMEAN PRSMAX PRSMIN HOUR     HR
0      2001   1   1  949.5  951.8  947.8   01  949.3
3227   2001   1   1  949.5  951.8  947.8   02  949.1
6454   2001   1   1  949.5  951.8  947.8   03  948.5
9681   2001   1   1  949.5  951.8  947.8   04  948.5
12908  2001   1   1  949.5  951.8  947.8   05  948.3
16135  2001   1   1  949.5  951.8  947.8   06  948.6
19362  2001   1   1  949.5  951.8  947.8   07  949.5
22589  2001   1   1  949.5  951.8  947.8   08  950.3
25816  2001   1   1  949.5  951.8  947.8   09  951.6
29043  2001   1   1  949.5  951.8  947.8   10  951.8
32270  2001   1   1  949.5  951.8  947.8   11  951.8
35497  2001   1   1  949.5  951.8  947.8   12  950.8
38724  2001   1   1  949.5  951.8  947.8   13  949.7
41951  2001   1   1  949.5  951.8  947.8   14  948.8
45178  2001   1   1  949.5  951.8  947.8   15  947.8
48405  2001   1   1  949.5  951.8  947.8   16  947.8
51632  2001   1   1  949.5  951.8  947.8   17  947.8
54859  2001   1   1  949.5  951.8  947.8   18  947.8
58086  2001   1   1  949.5  951.8  947.8   19  948.8
61313  2001   1   1  949.5  951.8  947.8   20  949.7
64540  2001   1   1  949.5  951.8  947.8   21  949.9
67767  2001   1   1  949.5  951.8  947.8   22  950.7
70994  2001   1   1  949.5  951.8  947.8   23  950.6
74221  2001   1   1  949.5  951.8  947.8   24    950
1      2001   1   2  951.2  953.4  948.8   01  949.8
3228   2001   1   2  951.2  953.4  948.8   02  949.5
6455   2001   1   2  951.2  953.4  948.8   03    949
9682   2001   1   2  951.2  953.4  948.8   04  948.8
12909  2001   1   2  951.2  953.4  948.8   05  948.9
16136  2001   1   2  951.2  953.4  948.8   06  949.7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • There are 2 Year columns coming ,and I have no idea why-`Index(['Year', 'YEAR', 'MN', 'DT', '..HR01', '..HR02', '..HR03', '..HR04', '..HR05', '..HR06', '..HR07', '..HR08', '..HR09', '..HR10', '..HR11', '..HR12', '..HR13', '..HR14', '..HR15', '..HR16', '..HR17', '..HR18', '..HR19', '..HR20', '..HR21', '..HR22', '..HR23', '..HR24', 'PRMEAN', 'PRSMAX', 'PRSMIN'], dtype='object')` – Bing Feb 01 '19 at 08:22
  • opps, one bug - nes sorting by `variable` column, not by `PRSMIN`. – jezrael Feb 01 '19 at 08:35
  • Now just getting an empty data frame – Bing Feb 01 '19 at 08:40
  • So if use only `df.columns = df.columns.str.strip()` `df1 = (df.melt(id_vars=['YEAR','MN','DT','PRMEAN','PRSMAX','PRSMIN'],value_name='HR') .sort_values('variable') .drop('variable', axis=1))` it not working? Are data confidental? – jezrael Feb 01 '19 at 08:43
  • Yes, It's not working and I have no clue why its not working. No, the data is not confidential, let me share a google drive link. – Bing Feb 01 '19 at 08:44
  • @Bing - For me it working nice, added solution to answer. Maybe some typo I guess or similar. – jezrael Feb 01 '19 at 08:51
  • Oh, it worked , must be taking some garbage values from the previous inputs.,but how to sort the Year, MT,DT values correspondingly – Bing Feb 01 '19 at 08:57
  • 1
    So need change `.sort_values('variable')` to `.sort_values(['variable','Year','MT','DT'])` or `.sort_values(['Year','MT','DT'])` ? – jezrael Feb 01 '19 at 08:59
  • No, wait a sec , the output is not correct , I need all the hour values corresponding to every date in the dataframe but it is giving only the first value – Bing Feb 01 '19 at 09:05
  • So basically it will be 24 values(corresponding to each hour) to every day – Bing Feb 01 '19 at 09:05
  • 1
    @Bing - I change solution with not removing HOURS and sorting, can you check now? – jezrael Feb 01 '19 at 09:13
  • I need to ask you a question because I am just awestruck by what happened here- How to become this much good like you? – Bing Feb 01 '19 at 09:23
  • 1
    @Bing - Many hours coding, many years coding... Not easy way, unfortunately. – jezrael Feb 01 '19 at 09:25
  • 1
    Yep, There's never been an easy way.Thanks a lot – Bing Feb 01 '19 at 09:26