4

Firstly - apologies but I am unable to reproduce this error using code. I will try and describe it as best as possible using screenshots of the data and errors.

I've got a large dataframe indexed by 'Year' and 'Season' with values for latitude, longitude, and Rainfall with some others which looks like this: enter image description here

This is organised to respect the annual sequence of 'Winter', 'Spring', 'Summer', 'Autumn' (numbers 1:4 in Season column) - and I need to keep this sequence after conversion to an Xarray Dataset too. But if I try and convert straight to Dataset:

future = future.to_xarray()

I get the following error: enter image description here

So it is clear I need to reindex by unique identifiers, I tried using just lat and lon but this gives the same error (as there are duplicates). Resetting the index then reindexing then using lat, lon and time like so:

future = future.reset_index()
future.head()

enter image description here

future.set_index(['latitude', 'longitude', 'time'], inplace=True)
future.head()

enter image description here

allows for the

future = future.to_xarray()

code to work:

enter image description here

The problem is that this has now lost its annual sequencing, you can see from the Season variable in the dataset that it starts at '1' '1' '1' for the first 3 months of the year but then jumps to '3','3','3' meaning we're going from winter to summer and skipping spring.

This is only the case after re-indexing the dataframe, but I can't convert it to a Dataset without re-indexing, and I can't seem to re-index without disrupting the annual sequence. Is there some way to fix this?

I hope this is clear and the error is illustrated enough for someone to be able to help!

EDIT: I think the issue here is when it indexes by date it automatically orders the dates chronologically (e.g. 1952 follows 1951 etc), but I don't want this, I want it to maintain the sequence in the initial dataframe (which is organised seasonally, but it could have a spring from 1955 followed by a summer from 2000 followed by an autumn from 1976) - I need to retain this sequence.

EDIT 2:

So the dataset looks like this when I set 'Year' as the index, or just keep the index generic enter image description here but I need the tg variable to have lat/lon associated with it so the dataset looks like this:

<xarray.Dataset>
Dimensions:    (Year: 190080)
Coordinates:
  * Year       (Year) int64 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
Data variables:
    Season     (Year) object '1' '1' '2' '2' '2' '3' '3' '3' '4' '4' '4' '1' ...
    latitude   (Year) float64 51.12 51.12 51.12 51.12 51.12 51.12 51.12 ...
    longitude  (Year) float64 -10.88 -10.88 -10.88 -10.88 -10.88 -10.88 ...
    seasdif    (Year) float32 -0.79192877 -0.79192877 -0.55932236 ...
    tg         (Year, latitude, longitude) float32 nan nan nan nan nan nan nan nan nan nan nan ...
    time       (Year) datetime64[ns] 1970-01-31 1970-02-28 1970-03-31 ...
Pad
  • 841
  • 2
  • 17
  • 45
  • I am not familiar with `to_xarray`, doesn't it sort your data according to your index since it uses it as coordinates? If so, that could explain that the first printed data won't be sorted the way you expect them to be. – ysearka Aug 20 '18 at 12:03
  • Yes, that is correct. But I can't index the data by just latitude and longitude or I hit an error (same as above), so am am not sure how to overcome this. – Pad Aug 20 '18 at 13:47
  • Is there a problem in using both temporal and geographical information as coordinates? `future.set_index(['Year','season','latitude', 'longitude', 'time'])` – ysearka Aug 20 '18 at 14:21
  • When I try this it works and creates a dataset which I converted to NetCDF, but when I try and open it in panoply it says `NcException: Could not initiate dimension variable Season` – Pad Aug 20 '18 at 15:34
  • Your season is an object. What happens when you convert it to a number? df['Season'] = pandas.to_numeric(df['Season']) Regarding your indexing problem I guess that is rather a logic mistake and how xarray works. I think @ysearka points in the right direction. Is xarray a hard requirement as an export format? Otherwise you could try a more suitable one for this case. – Viktor Aug 22 '18 at 20:12
  • I tried converting to a number and then re-indexing using Season, but it still gives the `cannot handle non-unique multi-index`error, so I have to then index by 'time' which again organises chronologically. Xarray isn't a hard requirement as an export format - I need to convert the dataframe to netcdf - is there another way of doing this in Python? Apologies I am not an expert in this at all. – Pad Aug 23 '18 at 07:53
  • Here is something about other way to export to netcdf - https://stackoverflow.com/questions/14035148/import-netcdf-file-to-pandas-dataframe – Anna Iliukovich-Strakovskaia Aug 23 '18 at 09:01
  • @anna this is about importing not exporting? – Pad Aug 23 '18 at 09:17
  • @Pad I'm sorry, you are right. What about using Year, Season and time as index? Are they uniques? – Anna Iliukovich-Strakovskaia Aug 23 '18 at 09:37
  • Yes this writes the dataframe but it loses the time sequence I have given it, and orders it chronologically. Once I order by time (the only unique variable) it sequences the data chronologically, which I need to avoid doing. – Pad Aug 23 '18 at 09:39
  • 1
    What I tried was adding a generic index and converting it directly to xarray. It converted without an error for my minimal example. But I am unable to drop the generic index column from the xarray. This did not change my order. – Interested_Programmer Aug 24 '18 at 14:02
  • What generic index did you use? I tried using a separate date index but it did not work as it ran on for too long (the dataframe is thousands of lines long due to multiple lat/lons). I need a generic index that will allow for the sequencing to be preserved :/ – Pad Aug 24 '18 at 14:25
  • What if you don't use multi-index, do you still need the index to be unique? Like, use .set_index('Year') and try converting to xarray? I didn't get any error messages doing a similar trial – Alexandr Kapshuk Aug 24 '18 at 14:41
  • I just used reset_index to set a numeric index. Basically I did not proceed to this step : (future.set_index(['latitude', 'longitude', 'time'], inplace=True)) And I just converted to xarray – Interested_Programmer Aug 24 '18 at 14:48
  • @alexandr yes - this creates the file! But the variables no longer have coordinates associated with them, I will see if my model runs using this data. Thank you! – Pad Aug 24 '18 at 15:24
  • @interested this also works, but with the same coordinate issue. I will try and assign the coordinates and see if this changes things. Thank you! – Pad Aug 24 '18 at 15:27
  • 1
    After creating xarray use df.set_coords(['Year','Season']) – Interested_Programmer Aug 24 '18 at 18:10
  • @Int sorry for delay, when I do this it generates the file, in the following format (https://pasteboard.co/HB6nAo4.png) - the problem is that the variables are indexed by (Year: 190080), and not latitude and longitude, and when I assign lat/lon as coordinates this doesn't apply them to individual variables, so I can't actually plot these variables as maps as they don't have associated lat/lon. Ideally I need to look at the `tg` variable with associated time, lat and lon coordinates. But I can't assign these (just stays as year) as the dataframe was only indexed by 'Year' or the generic index – Pad Aug 27 '18 at 11:22
  • I reset coords to lat long. This is how my df looks. (https://pasteboard.co/HBa8JW2.jpg). Could you update the question with your code for plot and how you ideally want it to look. – Interested_Programmer Aug 27 '18 at 20:55
  • @int I have edited the code to illustrate what I mean (see the tg variable) – Pad Aug 28 '18 at 09:48
  • @Pad What is this tg variable? Is this generated or an existing variable? – Interested_Programmer Aug 28 '18 at 13:05
  • @int it's an existing temperature at ground variable! Equivalent to rr in upper screenshots (sorry!) – Pad Aug 28 '18 at 13:33

1 Answers1

1

Tell me if this works for you. I have added an extra index column and use it to sort in the end.

import pandas as pd
import xarray as xr
import numpy as np

df = pd.DataFrame({'Year':[1951,1951,1951,1951],'Season':[1,1,1,3],'lat': 
[51,51,51,51],'long':[10.8,10.8,10.6,10.6],'time':['1950-12-31','1951-01-31','1951- 
02-28','1950-12-31']})

Made the index as a separate column 'Order' and then used it along with set_index. This is due to the fact that, I could sort through only an index or 1-D column and we had three coordinates.

df.reset_index(level=0, inplace=True)
df = df.rename(columns={'index': 'Order'})
df['time'] = pd.to_datetime(df['time'])
df.set_index(['lat', 'long', 'time','Order'], inplace=True)
df.head()
df = df.to_xarray()

This should preserve the order and have lat,lon,time associated with tg(I dont have it in my df though).

df2 = df
df2.sortby('Order')

You could also drop the 'Order' column, though I am not sure if it will alter your order.(It does not alter mine)

df2.drop('Order')

df

  • Thank you! When I try and convert to xarray I'm hitting a memory error, strange as it worked very quickly before... I will try and make it quicker somehow... – Pad Aug 29 '18 at 09:35
  • Sorry- I am still hitting memory errors, I am trying to split it into smaller pieces ! – Pad Aug 30 '18 at 09:24