0

I have a huge .csv file(2.3G) which I have to read into pandas dataframe.

start_date,wind_90.0_0.0,wind_90.0_5.0,wind_87.5_2.5
1948-01-01,15030.64,15040.64,16526.35
1948-01-02,15050.14,15049.28,16526.28
1948-01-03,15076.71,15075.0,16525.28

I want to process above data into below structure:

    start_date    lat    lon    wind
0   1948-01-01    90.0    0.0   15030.64
1   1948-01-01    90.0    5.0   15040.64
2   1948-01-01    87.5    2.5   16526.35
3   1948-01-02    90.0    0.0   15050.14
4   1948-01-02    90.0    5.0   15049.28
5   1948-01-02    87.5    2.5   16526.28
6   1948-01-03    90.0    0.0   15076.71
7   1948-01-03    90.0    5.0   15075.0
8   1948-01-03    87.5    2.5   16525.28

Code I have so far which does what I want but is too slow and takes up a lot of memory.

def load_data_as_pandas(fileName, featureName):
    df = pd.read_csv(fileName)
    df = pd.melt(df, id_vars = df.columns[0])
    df['lat'] = df['variable'].str.split('_').str[-2]
    df['lon'] = df['variable'].str.split('_').str[-1]
    df = df.drop('variable', axis=1)
    df.columns = ['start_date', featureName,'lat','lon']
    df = df.groupby(['start_date','lat','lon']).first()
    df = df.reset_index()
    df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d', errors='coerce')
    return df

1 Answers1

0

This should spead up your code:

We can use melt to unpivot your data from wide to long. Then we use str.split on the column name (values) and use expand=True to get a new column for each split. Finally we join these newly created columns back to our original dataframe:

melt = df.melt(id_vars='start_date').sort_values('start_date').reset_index(drop=True)

newcols = melt['variable'].str.split('_', expand=True).iloc[:, 1:].rename(columns={1:'lat', 2:'lon'})

final = melt.drop(columns='variable').join(newcols)

Output

   start_date     value   lat  lon
0  1948-01-01  15030.64  90.0  0.0
1  1948-01-01  15040.64  90.0  5.0
2  1948-01-01  16526.35  87.5  2.5
3  1948-01-02  15050.14  90.0  0.0
4  1948-01-02  15049.28  90.0  5.0
5  1948-01-02  16526.28  87.5  2.5
6  1948-01-03  15076.71  90.0  0.0
7  1948-01-03  15075.00  90.0  5.0
8  1948-01-03  16525.28  87.5  2.5

Timeit test on 800k rows:

3.55 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Erfan
  • 40,971
  • 8
  • 66
  • 78