Context
I have a csv file with a semicolon ';' as separator:
(...)
Processing file; 67/100; id; 27209; 107; words written. Time; 4.55; s
Processing file; 68/100; id; 17103; 14; words written. Time; 4.42; s
Processing file; 69/100; id; 17090; ; ('Error, no record fetch from query for file 17090.',). Time; 0.01; s
Processing file; 70/100; id; 10947; 40; words written. Time; 4.63; s
Processing file; 71/100; id; 17937; 50; words written. Time; 4.47; s
(...)
There are also some tab characters in there.
I'd like to load this csv in a Pandas DataFrame.
To do that, I'm using pandas .read_csv()
method;
import pandas as pd
file = '/path/to/file.csv'
df = pd.read_csv(file, sep=';', skiprows=1, header=None)
df.drop([0,2,5,7], axis=1, inplace=True)
I drop the columns containing only textual information because I don't need them in order to further process the data.
When the middle line of the csv file (the one containing the error) is not present, everything is working fine and the DataFrame looks like:
1 3 4 6
66 67/100 27209 107 4.55
67 68/100 17103 14 4.42
69 70/100 10947 40 4.63
70 71/100 17937 50 4.47
With following data types:
In [4]: df2.dtypes
Out[4]:
1 object
3 int64
4 int64
6 float64
dtype: object
Prior to the column drop, a typical line is as follow:
In [2]: df.iloc[0]
Out[2]:
0 Processing file
1 1/100
2 id
3 17889
4 17
5 words written.\tTime
6 3.1
7 s
Name: 0, dtype: object
But when this line is present, I've got an issue as it seems the tab characters in kept in the columns when on one single row there is no numeric value (i.e. the row with the Error);
1 3 4 6
66 67/100 27209 \t 107 4.55
67 68/100 17103 \t 14 4.42
68 69/100 17090 \t 0.01
69 70/100 10947 \t 40 4.63
70 71/100 17937 \t 50 4.47
With following data types:
In [8]:df.dtypes
Out[8]:
1 object
3 int64
4 object
6 float64
dtype: object
Again, prior to the column drop, a typical line is as follow:
In [2]: df.iloc[0]
Out[2]:
0 Processing file
1 1/100
2 id
3 17346
4 \t 6
5 words written.\tTime
6 7.42
7 s
This extra tab is obviously not welcome in my computations as it changes the type of my 4-th column. Of course, I can get rid of it by processing the DataFrame such as;
df[4] = df[4].replace('\t','', regex=True)
df[4] = df[4].str.strip()
df[4] = pd.to_numeric(df[4])
But then the type of the column 4 if float64
and not int64
because float seems to be the only one to deal with empty value or NaN as answered here and explained here; https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na.
I would personally prefer getting rid of the tab character while reading the csv file, i.e. while using .read_csv()
method.
I investigated a little the na_values
option, e.g. by setting na_values='\t'
while reading the file but it obviously doesn't work as the tab character is present in every row.
I also tried to define the separator as sep=';\t'
and adding engine=python
while reading the file, but then the semicolon is still there even if I specified it in the sep
option.
Question
Why can Pandas correctly read my csv file with a bunch of tabs when the middle rows is not there, and it get fooled when it's there?
And how to fix that?
Environment
Ubuntu 18.04
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
Pandas '1.0.3'