How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

Question

I need to create a data frame by reading in data from a file, using read_csv method. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):
   fld = line.split()

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

score 176 · Accepted Answer · edited May 03 '20 at 00:25

176

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

edited May 03 '20 at 00:25

AMC

2,642
7
13
35

answered Feb 22 '13 at 14:51

DSM

342,061
65
592
494

6

You can also use `skipinitialspace` to skip initial space – jarondl Dec 04 '14 at 14:28
how does it work, when i have a `txt.file` and the Format that I have a number followed by at least two whitespaces? the formula `delimiter=r"\d\d\s\s+`does not work – PV8 May 23 '19 at 07:49
@PV8: in that case your delimiter is still only `\s+` or `\s{2,}` . Don't include an expression for the field `\d\d` in the delimiter! Delimiter is only the delimiter! Field is everything between delimiters (or quotes). – smci Feb 11 '20 at 00:35
How to access the single element from the list? If i do like this df = pd.read_csv("whitespace.csv", header=None, delim_whitespace=True), print(df[0]) only prints the first element 0 but it doesnt print next element if i do print(df[1]) isnt it a list? – Sniper Jan 06 '21 at 04:54
'r' is redundant in delimiter=r"\s+", you can use delimiter="\s+" – Shevach Riabtsev Oct 25 '22 at 15:57

score 21 · Answer 2 · answered Dec 25 '16 at 11:32

21

>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")

would use any combination of any number of spaces and tabs as the separator.

answered Dec 25 '16 at 11:32

Peaceful

4,920
15
54
79

Gerben · Answer 3 · 2017-11-04T14:53:41.603

4

Pandas has two csv readers, only is flexible regarding redundant leading white space:

pd.read_csv("whitespace.csv", skipinitialspace=True)

while one is not

pd.DataFrame.from_csv("whitespace.csv")

Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.

edited Nov 04 '17 at 14:53

answered Nov 04 '17 at 14:37

Gerben

93
8

score 2 · Answer 4 · answered Oct 27 '17 at 04:26

2

We may consider this to take care of all the combination and zero or more occurrences.

pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")

answered Oct 27 '17 at 04:26

yoonghm

4,198
1
32
48

How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

4 Answers4

Linked

Related