Ignoring multiple commas while reading csv in pandas

Question

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a. Each file has data like :

Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php

I need two columns in my pandas df: Login_id and Web.

I am facing error when I try to read records like 2.

df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)

I am facing the following error : ValueError: Columns must be same length as key.

Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

You may want to take a look at: https://stackoverflow.com/questions/52428968/valueerror-columns-must-be-same-length-as-key — Pedro Lobito, Jul 22 '19 at 15:14
Thanks Pedro. I checked. It is different problem. I am trying to read multiple files with more commas in second column. — usr_lal123, Jul 22 '19 at 15:16
Hi rafaelc, Thought of reading it as single column and then split it into two columns. — usr_lal123, Jul 22 '19 at 15:18
Hm I see, so you have an uneven csv file. How do you believe the data frame should look like when you have more than one website in a row? — rafaelc, Jul 22 '19 at 15:19
Do you create the CSV with excel? and do your files only have two columns? If so, you can use `pd.read_csv(yourfilename, usecols=range(2), sep=',', index_col=False)` that tells pandas to ignore the rest of the columns. Use `index_col=False` in case you don't want the first column to be the index, otherwise use `index_col=0`. — jottbe, Jul 22 '19 at 15:19
@ rafaelc No. For sure. I have a single website. But, the single website has commas in its url. For example, https://m.economictimes.com/wealth/ifsccode/bank-idbi-bank,state-odisha,district-angul,branch-khemala,ifsccode-IBKL0002042.cms — usr_lal123, Jul 22 '19 at 15:20
@jottbe. I might loose data in my second column if I use your code. — usr_lal123, Jul 22 '19 at 15:22
Yes, just read your answer. But then you definetly shouldn't use split, but something customized. — jottbe, Jul 22 '19 at 15:28
So if you know that the first column is always correct, and the first comma is the only one you want to split on, then you could try: `df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.partition(',')[[0,2]]` — PyPingu, Jul 22 '19 at 15:31
Thanks PyPingu. I have huge files with millions of records in each. Any optimal way instead ? — usr_lal123, Jul 22 '19 at 15:33

jottbe · Accepted Answer · 2019-07-23T06:37:21.027

1

Solution 1: use split with argument n=1 and expand=True.

result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']

That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).

EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:

result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)

This splits the field and uses the names of the matching groups to create columns with their content. The output is:

  Login_id                       URL
0        1         http://www.x1.com
1        2  http://www.x1.com,as.php

Solution 3: convetional version with regex: You could do something customized, e.g with a regex:

import re
sp_re= re.compile('([^,]*),(.*)')

aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]

The result on your example data is:

                Login_id, Web Login_id                       URL
0         1,http://www.x1.com        1         http://www.x1.com
1  2,http://www.x1.com,as.php        2  http://www.x1.com,as.php

Now you could drop the column 'Login_id, Web'.

edited Jul 23 '19 at 06:37

answered Jul 22 '19 at 15:33

jottbe

4,228
1
15
31

1

Are you sure of that regex? Shouldn't the first `*` be inside the group? Otherwise it would match only the last character of the first field... `>>> re.match(r'([^,])*,(.*)', 'abc,123').groups() ('c', '123')` while having `*` inside the first group gives the expected `('abc', '123')`. – Giacomo Alzetta Jul 22 '19 at 15:45
Thank you @jottbe. I am trying to read multiple files with million records in each. Is there any optimized solution like having a customized function in read_csv itself so that it can be computationally efficient. – usr_lal123 Jul 22 '19 at 23:48
I am not aware of another method. Have you tried this method? I guess because it is implemented in a pandas function it might be implemented in C or C++. If it is not fast enough maybe you could change the process that writes the file, so it uses `;` or `|` as separator, so the separator does not interfere with the commas in the URL column? If you can't change the writing process, you could still try to preprocess the file using commands like `sed` (if you work on a unix-like system). e.g. `sed -e's/^\([^,]*\),/\1;/g' yourfile > newfile` (version for just two columns). `sed` is quite fast. – jottbe Jul 23 '19 at 06:23
Oh, I just found another mehod :-) Somehow I got it wrong what `split` does if `n` is passed at first (thought it works from the end of the string, but it actually works from the beginning how we want it), I update my answer. So now you have a lot of methods, from which you can choose the fastest. – jottbe Jul 23 '19 at 06:35

Ignoring multiple commas while reading csv in pandas

1 Answers1