This question is similar to How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
I have a text file in this format
year jan feb mar apr may jun jul aug sep oct nov dec win spr sum aut ann
2017 0.2 3.6 5.0 4.2 8.8 12.2 12.9 11.7 9.7 9.2 3.5 1.8 2.01 6.01 12.27 7.48 6.92
2018 2.4 -0.5 1.9 6.6 7.9 10.8 13.5 12.8 9.6 7.2 5.2 3.8 1.32 5.43 12.36 7.33 6.80
2019 0.9 1.8 4.4 3.6 6.5 10.8 13.3 12.6 10.0 7.2 3.6 2.9 2.22 4.85 12.25 6.90 6.49
2020 3.8 3.3 2.8 4.8 6.9 3.31 4.81
The text file has an irregular number of spaces between columns [3-4] and I do not need the columns ['win','spr','sum','aut','ann']
Firstly, to handle the irregular spaces I used this:
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python')
However, this collapsed the values for 'win' and 'spr' into 'jun' and 'jul'
Next I tried
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s[0-4]',parse_dates=parse_column, engine='python')
But this results in
ValueError: 'year' is not in list
Finally I tried to remove the unnecessary columns as part of the import like this:
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python',usecols=['year','jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'])
This however produces the same result as the first attempt.
I'm hoping there's a relatively simple regex that I'm missing, but variations on r'\s[01-5]'
either exclude the 'year' column or return error messages such as x columns expected, y found
I'm trying to avoid having to remove these incorrectly parsed values after loading as there are so many variations of erring data as we move through the year.