The strange jump over when I use pandas to read specific columns in .csv

Question

1. Background

The .csv file I upload here is an example file for me to explain my problem.

This file contain all the air quality information for all cities in China(represent in Code) in at an specific day.

For example, the column 1001A represent one city and the value in this column represent the air pollutant concentration corresponding to the type column.

1. My problem

If I want to get the AQI value for the city of 1014A in 20160205-00:00,
I just need to use

 df = pd.read_csv("./this file")
 aqi = df["1014A"].iloc[0]

The result is 42. But look the same file in LibraOffice, the result shows like this:

It seems like Pandas read the 1013A and make the mistake.

So, I want to figure out what happened in column 1013A:

The pandas read this column(which has finite value inside) as the NaN value column. And it happened so many times in this file. It bother me in the aspects of followed:

Some columns which has its data are taken as NaN columns in pandas.Dataframe
The other columns also will be influenced by the Error-NaN columns indirectly.

The column location would be full of mistake if this problem hasn't been solved.

Any advice would be appreciate!

1013A is empty in my LibraOffice – open source guy May 26 '16 at 04:58 — open source guy, May 26 '16 at 04:58

score 2 · Answer 1 · answered May 26 '16 at 05:00

2

Your csv has two commas in that position:

...19,20,24,19,22,24,29,,42,39...

this gets read as NaN by pandas.

It looks like in your version of LibreOffice it's skipped and uses the subsequent value (incorrectly).

In [11]: s = open("china_sites_20160205.csv").readlines()

In [12]: s[0].split(",")[13:18]
Out[12]: ['1011A', '1012A', '1013A', '1014A', '1015A']

In [13]: s[1].split(",")[13:18]
Out[13]: ['24', '29', '', '42', '39']

answered May 26 '16 at 05:00

Andy Hayden

359,921
101
625
535

Thanks for your reply. So, my problem happened because of the _double comma_. Can I use `s.replace(",,", "",)` to solve it? – Han Zhengzu May 26 '16 at 08:52
The problem you're going to have is that the columns don't add up then (you'll get an error like `ValueError: Expected 1500 fields in line 2, saw 1445`), in the LibreOffice you'll see this data doesn't line up on the right-hand-side (because it skips). I suspect this data is "missing" and NaN is what you want, BUT you should check with the vendor/person who produced the csv: ask what the ,, means. – Andy Hayden May 26 '16 at 16:46

The strange jump over when I use pandas to read specific columns in .csv

1. Background

1. My problem

1 Answers1