1

Context

I have a csv file with a semicolon ';' as separator:

(...)
Processing file;  67/100; id;    27209;   107; words written.   Time;   4.55; s
Processing file;  68/100; id;    17103;    14; words written.   Time;   4.42; s
Processing file;  69/100; id;    17090;      ; ('Error, no record fetch from query for file 17090.',).  Time;   0.01; s
Processing file;  70/100; id;    10947;    40; words written.   Time;   4.63; s
Processing file;  71/100; id;    17937;    50; words written.   Time;   4.47; s
(...)

There are also some tab characters in there.

I'd like to load this csv in a Pandas DataFrame.

To do that, I'm using pandas .read_csv() method;

import pandas as pd
file = '/path/to/file.csv'
df = pd.read_csv(file, sep=';', skiprows=1, header=None)
df.drop([0,2,5,7], axis=1, inplace=True)    

I drop the columns containing only textual information because I don't need them in order to further process the data.

When the middle line of the csv file (the one containing the error) is not present, everything is working fine and the DataFrame looks like:

           1      3    4     6
66    67/100  27209  107  4.55
67    68/100  17103   14  4.42
69    70/100  10947   40  4.63
70    71/100  17937   50  4.47

With following data types:

In [4]: df2.dtypes
Out[4]: 
1     object
3      int64
4      int64
6    float64
dtype: object

Prior to the column drop, a typical line is as follow:

In [2]: df.iloc[0]
Out[2]: 
0                 Processing file
1                           1/100
2                              id
3                           17889
4                              17
5            words written.\tTime 
6                             3.1
7                               s
Name: 0, dtype: object

But when this line is present, I've got an issue as it seems the tab characters in kept in the columns when on one single row there is no numeric value (i.e. the row with the Error);

           1      3        4     6
66    67/100  27209  \t  107  4.55
67    68/100  17103  \t   14  4.42
68    69/100  17090  \t       0.01
69    70/100  10947  \t   40  4.63
70    71/100  17937  \t   50  4.47

With following data types:

In [8]:df.dtypes
Out[8]: 
1     object
3      int64
4     object
6    float64
dtype: object

Again, prior to the column drop, a typical line is as follow:

In [2]: df.iloc[0]
Out[2]: 
0              Processing file
1                        1/100
2                           id
3                        17346
4                      \t    6
5         words written.\tTime 
6                         7.42
7                            s

This extra tab is obviously not welcome in my computations as it changes the type of my 4-th column. Of course, I can get rid of it by processing the DataFrame such as;

df[4] = df[4].replace('\t','', regex=True)
df[4] = df[4].str.strip()
df[4] = pd.to_numeric(df[4])

But then the type of the column 4 if float64 and not int64 because float seems to be the only one to deal with empty value or NaN as answered here and explained here; https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na.

I would personally prefer getting rid of the tab character while reading the csv file, i.e. while using .read_csv() method.
I investigated a little the na_values option, e.g. by setting na_values='\t' while reading the file but it obviously doesn't work as the tab character is present in every row.
I also tried to define the separator as sep=';\t' and adding engine=python while reading the file, but then the semicolon is still there even if I specified it in the sep option.

Question

Why can Pandas correctly read my csv file with a bunch of tabs when the middle rows is not there, and it get fooled when it's there?
And how to fix that?

Environment

Ubuntu 18.04
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
Pandas '1.0.3'

swiss_knight
  • 5,787
  • 8
  • 50
  • 92
  • 1
    add the tab to ur separator as ```'s``` : ```df = pd.read_csv(file, sep= r'[;\s{4}]', skiprows=1, header=None, engine = 'python') ``` and see if works. i possibly bungled the regex, but the idea is to combine the tab and semicolon... tab is four spaces. – sammywemmy Apr 22 '20 at 08:42
  • Hmm, I've tried it, not bad indeeed. But it does insert an extra column `[4]` with only `NaN` (in place of the tab I guess), hence increasing the number of columns in the DataFrame from 8 to 9. Is there a way to, let's say, concatenate the semicolon+tab here so that I end up with the same number of columns at the end? – swiss_knight Apr 22 '20 at 09:07
  • combining the tab with semicolon could work they appear at every point. the trick is somewhere in that separator regex. is the tab a literal tab (```t```) or actually four spaces? – sammywemmy Apr 22 '20 at 09:19
  • 1
    It's a literal tab that was added by another script as `'\t'` (as part of a string). I actually replaced your `\s{4}` by `\t` to fit my situation. – swiss_knight Apr 22 '20 at 09:25
  • kk. cool. so, does ```;``` and ```'\t'``` appear at the same place on every row? combined are they the delimiter? – sammywemmy Apr 22 '20 at 09:28
  • No, only `;` is truly a delimiter. The tab `\t` may occur several times in some single columns as part of strings and it should normally not be considered as a delimiter. – swiss_knight Apr 22 '20 at 09:39
  • that makes it somewhat complicated. I have run out of ideas. – sammywemmy Apr 22 '20 at 09:45

0 Answers0