1

I am trying to format some weather data from meteo.gr website.

All good so far.

  1. I can filter out headers
  2. and remove empty files

The problem as shown below is that in cases where the data file is missing all of the data besides the date (leftmost column) from the initial columns, pandas reads them as a single column dataframe.

Before you ask... pd.read_csv('...',delim_whitespace = 1)))

Also

I would prefer if we did not skip any rows. If not I'll manage.

How do we solve, dear members?

 1  
 2  
 3  
 4  
 5  18.5  22.5   3:30p  13.9   5:40a   1.1   1.2   0.0   7.1  35.4   4:00p     W
 6  20.6  22.1  12:40p  16.7  12:40a   0.1   2.4   0.0  19.0  54.7   5:50p     S
 7  20.9  22.2   1:40p  20.1   7:00a   0.0   2.6   0.0  22.9  53.1  10:50a     S
 8  19.7  21.7   7:00a  16.8  10:10a   0.1   1.4  16.2  11.1  56.3   4:10a     S
 9  18.6  22.2   1:00p  14.6   7:00a   0.8   1.1   0.0  12.1  56.3   3:30p     W
10  20.8  23.2  10:50a  15.7  12:30a   0.2   2.7   0.0  25.7  69.2  10:10a     S
11  20.2  22.2  12:40a  17.7   7:30a   0.0   1.9   0.0  11.6  54.7  12:30a     W
12  17.9  20.1   1:20p  14.6  11:00p   0.8   0.3   1.6   6.9  38.6   2:20p    NW
13  16.9  19.7  12:10p  13.8   2:50a   1.7   0.2   0.0   9.0  30.6   2:10p   WNW
14  16.8  18.4   1:30p  15.8   4:30a   1.6   0.0   0.0  14.5  48.3   3:50p    NW
15  16.8  19.3   2:20p  14.6  11:50p   1.7   0.1   0.0   6.0  30.6  12:10a   NNW
16  18.6  20.8  12:20p  14.7  12:10a   0.3   0.6   0.0  15.1  45.1   2:20p    NW
17  18.6  21.8   2:30p  16.6   3:50a   0.6   0.8   0.0   9.2  29.0  12:30p    NW
18  18.9  21.6  11:40a  16.9   1:30a   0.3   0.9   0.0  13.8  38.6  10:50a    NW
19  18.2  19.4  11:10a  17.3  11:30p   0.3   0.2   0.0  14.5  45.1   3:10p   NNW
20  18.9  21.3   2:10p  17.4  12:30a   0.2   0.8   0.0  12.7  51.5   5:10a    NW
21  18.9  21.4   2:00p  17.2  12:00m   0.2   0.8   0.0  10.5  37.0   2:50p   NNW
22  17.9  20.6   3:20p  14.3  12:00m   0.9   0.5   0.0   8.4  25.7  12:30a   WNW
23  15.7  18.4   2:10p  12.6   7:00a   2.7   0.0   0.0   6.3  20.9   5:20a     W
24  16.2  18.8   1:20p  13.3   7:50a   2.2   0.1   0.0   6.8  19.3   3:10a     W
25  16.7  18.8  10:10a  13.6   6:50a   1.7   0.1   0.4   8.7  25.7   1:50p   WNW
26  16.9  20.2   1:10p  14.1  10:50p   1.6   0.2   0.0   6.9  29.0   2:20p    NW
27  15.8  19.1   2:30p  12.4   7:10a   2.6   0.1   0.0   7.2  22.5   5:00a     W
28  16.8  20.5  12:40p  13.3   6:40a   1.9   0.4   0.0   6.0  19.3   5:10a     W
29  17.8  21.4  11:20a  14.1   5:50a   1.3   0.7   0.0   5.5  20.9   6:40p     W
30  17.2  19.6  10:50a  14.6  11:50p   1.4   0.3   0.0   5.3  17.7   2:50p     W
George Pamfilis
  • 1,397
  • 2
  • 19
  • 37
  • What's wrong with skipping rows, seems to be of little value – EdChum Nov 19 '15 at 12:00
  • the problem is: in 1000 or so files. the missing data starts from a different row. how do i make sure that the file is read properly. – George Pamfilis Nov 19 '15 at 12:02
  • You could iteratively read each line until you get more than 1 column parsed and then treat the remaining rows as valid data – EdChum Nov 19 '15 at 12:04
  • sometimes there are 2 columns instead of 1. in this case there should be 13. 13 is not always the number of columns. it varies from station to station. – George Pamfilis Nov 19 '15 at 12:08
  • its dirty data not filled with nans from the source – George Pamfilis Nov 19 '15 at 12:09
  • You could do something like `for i in range(10): print('row:' , i, 'num columns:' ,pd.read_csv(io.StringIO(t), delim_whitespace=True, skiprows=i, nrows=1).shape[1])` substitute 10 for however many lines, you then just take the max value from the above and then use this to skip the appropriate number of rows – EdChum Nov 19 '15 at 12:12
  • Could you open one file and see replacing `' '` with `|`?. If you get `1 ||||||||` for empty lines it is delimited with space if not your file is not delimited for empty lines – WoodChopper Nov 19 '15 at 13:47

0 Answers0