Reading data with more columns than expected into a dataframe

Question

I have a number of .csv files that I download into a directory.

Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:

17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883

The code that I am using to read the .csv into a dataframe (df) is:

df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])

Where I name the three columns (DATE, PX and RAW).

This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:

09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607

where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).

Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:

09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607

Is this what you're looking for? https://stackoverflow.com/questions/10867028/get-pandas-read-csv-to-read-empty-values-as-empty-string-instead-of-nan — user1394, Oct 29 '18 at 19:43
If it only happens in the top row, it can be worked around by not specifying names in the reading part, yet retaining `df = df.iloc[:,:3]; df.columns = ['DATE', 'PX', 'RAW']`; if not, it is a bit more tedious. However, on my machine it gives no problem reading in the whole matrix exactly like you specify, if I set `error_bad_lines=False` in the case of the bad line not being the first — Uvar, Oct 29 '18 at 19:48

ALollz · Answer 1 · 2018-10-30T02:05:51.257

Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:

import re
import csv

patt = re.compile(r",,")

with open('corrected.csv', 'w') as f2:
    with open('test.csv') as f:
        for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
            f2.write(','.join(str(x) for x in line))
            f2.write('\n')
f2.close()
f.close()

Output: `corrected.csv`

09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607

Then you should be able to read in this file without issue

import pandas as pd

df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])

         DATE        PX      RAW
0  09/07/2014  26268315      NaN
1  10/07/2014   6601181  16.3857
2  11/07/2014    916651  12.5879
3  14/07/2014    213357      NaN
4  15/07/2014    205019  10.8607

score -1 · Answer 2 · answered Oct 29 '18 at 20:16

-1

Had this problem yesterday. Have you tried:

pd.read_csv(adj_directory+'\\'+filename, 
            error_bad_lines=False,names=['DATE', 'PX', 'RAW'], 
            keep_default_na=False,
            na_values=[''])

answered Oct 29 '18 at 20:16

user3661058

17
7

thanks, but when I print the df I still get an empty dataframe Columns: [DATE, PX, RAW] Index: [] – Stacey Oct 29 '18 at 21:23

Reading data with more columns than expected into a dataframe

2 Answers2

Output: corrected.csv

Output: `corrected.csv`