pandas read_csv() for multiple delimiters

Question

I have a file which has data as follows

1000000 183:0.6673;2:0.3535;359:0.304;363:0.1835
1000001 92:1.0
1000002 112:1.0
1000003 154435:0.746;30:0.3902;220:0.2803;238:0.2781;232:0.2717
1000004 118:1.0
1000005 157:0.484;25:0.4383;198:0.3033
1000006 277:0.7815;1980:0.4825;146:0.175
1000007 4069:0.6678;2557:0.6104;137:0.4261
1000009 2:1.0

I want to read the file to a pandas dataframe seperated by the multiple delimeters \t, :, ;

I tried

df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt", sep='\t|:|;', header=None, engine='python')

It gives me the following error.

pandas.errors.ParserError: Error could be due to quotes being ignored when a multi-char delimiter is used.

Why am I getting this error?

So I thought I'll try to use the regex string. But I am not sure how to write a split regex. r'\t|:|;' doesn't work.

What is the best way to read a file to a pandas data frame with multiple delimiters?

@jezrael When using a capture group, the pipes are not needed (the pipe in fact becomes a delimiter). — cs95, Jan 02 '18 at 15:36
@cᴏʟᴅsᴘᴇᴇᴅ - btw, error means there is some problem in data I guess, so not this dupe, because dupe answer is same as OP use in this question :( — jezrael, Jan 02 '18 at 15:38
@cᴏʟᴅsᴘᴇᴇᴅ do you mean to say the regex string should be r'\t:;' ? I tried it and it did not seperate by data at all. `df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt", delimiter=r'\t:;', header=None, engine='python')` — user77005, Jan 02 '18 at 15:42
@cᴏʟᴅsᴘᴇᴇᴅ Thanks. However, I am still getting the `pandas.errors.ParserError: Expected 29 fields in line 11, saw 45. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.` error. I checked the data column, and there are no trailing commas or anything. Do you have any suggestion to debug it other than setting `error_bad_lines=False` because I have it appearing in too many data points? — user77005, Jan 02 '18 at 15:48
Yes, it was set to the default value. I did not change it. Still, I see the error.@jezrael I added more data to the question. Do you see any errors in the data that could cause this ParserError? — user77005, Jan 02 '18 at 15:57
"expected 29 fields in line 11, saw 45" is (I believe) not related to delimiter but rather that you have rows with variable number of columns — JohnE, Jan 02 '18 at 17:36
Possible duplicate of [Handling Variable Number of Columns with Pandas - Python](https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python) — Tai, Jan 02 '18 at 17:42
@coldspeed in playing around with this, it seems you do need some separator like `sep="\s+|;|:"` or else you need to put in brackets like `sep="[ ;:]"`. But just plain `sep=" ;:"` did not work for me. — JohnE, Jan 02 '18 at 18:04

Tai · Accepted Answer · 2020-07-26T16:47:19.520

19

From this question, Handling Variable Number of Columns with Pandas - Python, one workaround to pandas.errors.ParserError: Expected 29 fields in line 11, saw 45. is let read_csv know about how many columns in advance.

my_cols = [str(i) for i in range(45)] # create some col names
df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt",
                                   sep="\s+|;|:",
                                   names=my_cols, 
                                   header=None, 
                                   engine="python")
# I tested with s = StringIO(text_from_OP) on my computer

Hope this works.

edited Jul 26 '20 at 16:47

answered Jan 02 '18 at 17:40

Tai

7,684
3
29
49

1

I just did something similar which I won't post b/c this seems to work fine. ;-) – JohnE Jan 02 '18 at 18:01
@JohnE I saw your comment right after I posted it :D – Tai Jan 02 '18 at 18:08
let read_csv know about how many rows in advance - but my_cols?? a little confusing, rows or cols? – SKSKSKSK Jul 26 '20 at 11:24
@SunKe thanks for catching that, it should be my_cols :) – Tai Jul 26 '20 at 16:46
I am a bit confused about why ‘my_cols’ has been defined. I think it is not being used in pd.read_csv() command. – user408108 Jul 21 '21 at 07:21
1

@user408108 there is a `names` parameter in the function. – Tai Sep 29 '21 at 20:19

pandas read_csv() for multiple delimiters

1 Answers1

Linked