1

This question is similar to How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

I have a text file in this format

year    jan    feb    mar    apr    may    jun    jul    aug    sep    oct    nov    dec     win     spr     sum     aut     ann
2017    0.2    3.6    5.0    4.2    8.8   12.2   12.9   11.7    9.7    9.2    3.5    1.8    2.01    6.01   12.27    7.48    6.92
2018    2.4   -0.5    1.9    6.6    7.9   10.8   13.5   12.8    9.6    7.2    5.2    3.8    1.32    5.43   12.36    7.33    6.80
2019    0.9    1.8    4.4    3.6    6.5   10.8   13.3   12.6   10.0    7.2    3.6    2.9    2.22    4.85   12.25    6.90    6.49
2020    3.8    3.3    2.8    4.8    6.9                                                     3.31    4.81                        

The text file has an irregular number of spaces between columns [3-4] and I do not need the columns ['win','spr','sum','aut','ann']

Firstly, to handle the irregular spaces I used this:

    parse_column = ['year']
    weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python')

However, this collapsed the values for 'win' and 'spr' into 'jun' and 'jul'

Next I tried

    parse_column = ['year']
    weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s[0-4]',parse_dates=parse_column, engine='python')

But this results in

ValueError: 'year' is not in list

Finally I tried to remove the unnecessary columns as part of the import like this:

parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python',usecols=['year','jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'])

This however produces the same result as the first attempt.

I'm hoping there's a relatively simple regex that I'm missing, but variations on r'\s[01-5]' either exclude the 'year' column or return error messages such as x columns expected, y found

I'm trying to avoid having to remove these incorrectly parsed values after loading as there are so many variations of erring data as we move through the year.

Greg Williams
  • 169
  • 1
  • 14
  • 2
    This seems to work, ```df = pd.read_csv(StringIO(text), sep="\s{1,}", parse_dates=['year'], engine="python")``` – sushanth Jun 24 '20 at 13:25
  • Thanks for this. ```sep="\s{1,}"``` appears to be functionally equivalent to ```sep=\s+``` However, that gave me the clue I needed. What meets my needs is ```sep="\s{1,5}"``` – Greg Williams Jun 24 '20 at 13:34

0 Answers0