Grouping non-sorted year column values to a single column using pandas

Question

I have a tsv file with the following values :

Jan year  Feb year  March year
23  1992  34  1991   43   1993
25  1990  36  1993   45   1990
21  1993  38  1992   47   1991
27  1991  30  1990   49   1992

How can I merge all the year columns to a single column so that I can have something like this :

year  Jan Feb March
1990  25   30  45
1991  27   34  47
1992  23   38  49
1993  21   36  43

I'm using pandas, and since the years are shuffled, hence I'm unable to consolidate them. Please let me know on what needs to be done to achieve the same.

Here is the data source, and in a nutshell, I want a single year column instead of multiple year columns.

Could you provide the data? Here you can see how to do it easily: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples/20159305#20159305 — Marco_CH, Jan 19 '22 at 13:20

tlgs · Answer 1 · 2022-01-19T18:58:51.240

The code below fetches the data directly from the URL you provided and should generate the output you expect.

import pandas as pd

df = pd.read_csv(
    "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Tmin/ranked/England_SE_and_Central_S.txt",
    sep="\s+",
    skiprows=5,
    storage_options={"User-Agent": "Mozilla/5.0"}
)

months = []
for i in range(df.shape[1] // 2):
    tmp = pd.DataFrame(df.iloc[:, 2*i]).set_index(df.iloc[:, 2*i + 1])
    months.append(tmp)

final_df = pd.concat(months, axis=1)

Output:

>>> final_df
      jan  feb  mar  apr  may   jun   jul   aug   sep  oct  nov  dec    win   spr    sum   aut   ann
1884  3.5  2.6  2.5  2.5  6.3   9.0  11.6  11.5  10.5  4.8  2.1  2.1    ---  3.77  10.75  5.79  5.76
1885 -0.0  3.2  0.4  3.2  4.8   9.2  10.8   9.4   8.3  3.9  3.2  0.4   1.69  2.78   9.82  5.11  4.72
1886 -1.1 -1.8  0.2  3.4  6.1   8.6  11.0  11.1   9.9  7.8  3.1 -1.0  -0.80  3.23  10.27  6.91  4.81
1887 -1.6 -0.1 -0.1  1.2  5.3   9.2  11.4  10.0   7.6  2.7  1.7 -0.1  -0.92  2.13  10.23  3.98  3.96
1888 -0.3 -1.6 -0.1  2.0  5.3   8.9  10.1  10.1   8.5  2.8  5.6  2.0  -0.66  2.39   9.73  5.58  4.45
...   ...  ...  ...  ...  ...   ...   ...   ...   ...  ...  ...  ...    ...   ...    ...   ...   ...
2017  0.2  3.6  5.5  4.2  8.7  11.9  13.2  11.8   9.8  9.1  3.3  2.0   2.07  6.14  12.29  7.44  6.96
2018  2.9 -0.5  2.1  6.7  8.0  11.1  13.7  12.5   9.3  6.9  5.3  4.4   1.55  5.60  12.47  7.16  6.92
2019  0.7  2.2  4.8  4.4  6.4  10.5  12.8  12.4  10.2  7.5  3.6  3.1   2.43  5.23  11.95  7.12  6.59
2020  3.8  3.8  3.2  5.1  7.1  10.9  11.7  14.1   9.9  7.8  5.9  2.8   3.55  5.12  12.24  7.87  7.18
2021  0.7  2.3  3.1  1.2  6.0  11.5  13.3  12.1  11.6  8.8  3.9  4.6   1.92  3.44  12.28  8.10  6.61

When I import the tsv file to a df, year columns get renamed to ```year```, ```year.1```, ```year.2``` and so on. Hence this code doesn't work, and throws error ```KeyError: "None of ['year'] are in the columns"``` — Aditya Singh, Jan 19 '22 at 13:45
I updated my answer now that you provided a data source. Let me know if that works. — tlgs, Jan 19 '22 at 13:50

Corralien · Accepted Answer · 2022-01-19T13:41:54.310

Group your columns by virtual groups:

out = df.groupby(np.repeat(np.arange(len(df.columns) // 2), 2), axis=1) \
        .apply(lambda x: x.set_index('year')).droplevel(0, axis=1).reset_index()
print(out)

# Output
   year  Jan  Feb  March
0  1990   25   30     45
1  1991   27   34     47
2  1992   23   38     49
3  1993   21   36     43

With your data, use this code before:

df = pd.read_csv('England_SE_and_Central_S.txt', skiprows=5, sep='\s+')
df.columns = df.columns.str.split('.').str[0]

out = df.groupby(np.repeat(np.arange(len(df.columns) // 2), 2), axis=1) \
        .apply(lambda x: x.set_index('year')).droplevel(0, axis=1).reset_index()

Output:

>>> out
     year  jan  feb  mar  apr  may   jun   jul   aug   sep  oct  nov  dec    win   spr    sum   aut   ann
0    1884  3.5  2.6  2.5  2.5  6.3   9.0  11.6  11.5  10.5  4.8  2.1  2.1    ---  3.77  10.75  5.79  5.76
1    1885 -0.0  3.2  0.4  3.2  4.8   9.2  10.8   9.4   8.3  3.9  3.2  0.4   1.69  2.78   9.82  5.11  4.72
2    1886 -1.1 -1.8  0.2  3.4  6.1   8.6  11.0  11.1   9.9  7.8  3.1 -1.0  -0.80  3.23  10.27  6.91  4.81
3    1887 -1.6 -0.1 -0.1  1.2  5.3   9.2  11.4  10.0   7.6  2.7  1.7 -0.1  -0.92  2.13  10.23  3.98  3.96
4    1888 -0.3 -1.6 -0.1  2.0  5.3   8.9  10.1  10.1   8.5  2.8  5.6  2.0  -0.66  2.39   9.73  5.58  4.45
..    ...  ...  ...  ...  ...  ...   ...   ...   ...   ...  ...  ...  ...    ...   ...    ...   ...   ...
133  2017  0.2  3.6  5.5  4.2  8.7  11.9  13.2  11.8   9.8  9.1  3.3  2.0   2.07  6.14  12.29  7.44  6.96
134  2018  2.9 -0.5  2.1  6.7  8.0  11.1  13.7  12.5   9.3  6.9  5.3  4.4   1.55  5.60  12.47  7.16  6.92
135  2019  0.7  2.2  4.8  4.4  6.4  10.5  12.8  12.4  10.2  7.5  3.6  3.1   2.43  5.23  11.95  7.12  6.59
136  2020  3.8  3.8  3.2  5.1  7.1  10.9  11.7  14.1   9.9  7.8  5.9  2.8   3.55  5.12  12.24  7.87  7.18
137  2021  0.7  2.3  3.1  1.2  6.0  11.5  13.3  12.1  11.6  8.8  3.9  4.6   1.92  3.44  12.28  8.10  6.61

[138 rows x 18 columns]

This is super elegant. arrange-->repeat-->groupby-->reset index — Pradeep, Jan 19 '22 at 14:41

Grouping non-sorted year column values to a single column using pandas

2 Answers2