Reading text file using read_csv with regex dilimiter

Question

I've been trying to read a text file university_towns.txt using read_csv, but as shown in the screenshot, on parsing the file using the regex delimiter as shown in the code below , I get an error of :

ParserError: Expected 2 fields in line 89, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

Is there any way to fix this, it seems only the double quotes are at fault at one place , also kindly explain why this is happening? I also tried to use the quotechar argument, but didn't understand how to use it.

My code for reading the file is as below:

university_towns = pd.read_csv('university_towns.txt', sep= "\s\(", engine='python', header=None)

university_towns.txt file image

Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Bradford (University of Pittsburgh at Bradford)
California (California University of Pennsylvania)[2]
Carlisle (Dickinson College)
Cecil B. Moore, Philadelphia, also known as "Templetown" (Temple University)
Clarion (Clarion University of Pennsylvania)[2]
Collegeville (Ursinus College)
Cresson (Mount Aloysius College)[2]
East Stroudsburg (East Stroudsburg University of Pennsylvania)[2]
Edinboro (Edinboro University of Pennsylvania)[2]
Erie (Gannon University, Mercyhurst College, Penn State Erie)
Gettysburg (Gettysburg College)[2]
Greensburg (Seton Hill University, University of Pittsburgh at Greensburg)
Grove City (Grove City College)[2]
Huntingdon (Juniata College)[2]
Indiana (Indiana University of Pennsylvania)[2]
Johnstown (University of Pittsburgh at Johnstown)
Kutztown (Kutztown University of Pennsylvania)[2]
Lancaster (Franklin & Marshall)
Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

Above I've pasted some of the lines for the text file. Also, the last line is line 89.

Thanks - but using this text I can not verify your error. Which one is line 89? — Patrick Artner, Jun 17 '18 at 08:44
See answer, add 2 params to your readcall to get output about bad lines and fix them — Patrick Artner, Jun 17 '18 at 08:52
your last line has two `(` in it, thats why pd wants to make it 3 columns instead of 2 like all the other lines. - it looks as if that line should be 2 seperate ones, one university for West Georgia nd one for North G. & State — Patrick Artner, Jun 17 '18 at 08:53

score 0 · Answer 1 · answered Jun 17 '18 at 08:42

0

Show column 83, some others can be seen here - I think there are two ( - \s\( there. At least that's what this error message means. Other possible problem is that there is strange character there, and the parser is simply lost. I don't think it is possible with university names... Anyway - look into that line. If that is not obvious - share it with us.

answered Jun 17 '18 at 08:42

Michał Zaborowski

3,911
2
19
39

I've shared line 83 , it's the last line in the text ( I just added it now) – aspiring1 Jun 17 '18 at 08:53
Issue is that two lines are combined together. Can it be due to manual edit? – Michał Zaborowski Jun 17 '18 at 08:57

Patrick Artner · Accepted Answer · 2018-06-18T07:45:18.383

0

Your textsampe does not reproduce the error. You can get hints what causes the error by adding some params to the call:

pd.read_csv('university_towns.txt', sep= "\s\(", 
            engine='python', header=None , 
            error_bad_lines= False, warn_bad_lines = True)

See pandas.read_csv()

Using

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

import pandas as pd

with open( 'university_towns.txt', "w") as f:
    f.write("""Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]""")

university_towns = pd.read_csv('university_towns.txt', sep= "\s\(", engine='python', header=None)

print(university_towns)

does reproduce your error. Reason being that the last line (which looks as if it should have been seperates line) includes 2 matches for your regex, hence wants to be split into 3 columns where all other rows only got 2 columns. => Error.

To fix it, split the offending line into 2:

Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Carrollton (University of West Georgia)[2]
Dahlonega (North Georgia College & State University)[2]

and it will work.

edited Jun 18 '18 at 07:45

answered Jun 17 '18 at 08:46

Patrick Artner

50,409
9
43
69

Can I stop this greedy splitting by any way ? So, that I can have only Carrollton in that row, and the rest eliminated, since there are other errors , such as : Springfield (American International College), (Springfield College), and (Western New England College) and I only want to grab Springfield in such a case. – aspiring1 Jun 18 '18 at 07:25
@aspiring1 No way I know of. You can specify `error_bad_lines= False` and ignore the whole line - but then you are missing 2 universitys. You could specify `comment='*'` and only loose the `Dahlonega (No..` part _if_ this is the only place where you have a `'*'` in your file - that wont fix your other errors though. You can write a preprocess script that splits the line for you before importing it to pandas... – Patrick Artner Jun 18 '18 at 07:33
@aspiring1 You can call pandas like this: `pd.read_csv('university_towns.txt', sep= "\s\(", engine='python', header=None, names=["1","2","3","4","5"]) ` ... this would fill `NaN` into your `df` so any row is filled up to 5 values, you would have to discard the ones you do not like afterwards - this wont fix the "2 universities in 1 row" case though – Patrick Artner Jun 18 '18 at 07:43
Yeah, this method looks good, I can remove the other columns after importing the data into the dataframe. Also, I' m curious as to why columns 1 to 4 take None value and column 5 takes NaN as value? – aspiring1 Jun 18 '18 at 07:54
I got clarification on Carrollton Line 89 : https://www.coursera.org/learn/python-data-analysis/discussions/weeks/4/threads/izO_OHLIEei99gpJiLeZ7g and thanks for all the answers the last method can solve my issue, another way to do it in the course discussion forum was to read using separator , sep = "\n" and then extract using str.split("(")[0]. – aspiring1 Jun 18 '18 at 08:02
@aspiring1 pd autodiscovers valuetypes - your data probably has at maximum four `' ('` in it, so all those colums get datatype string/object - you probably have no column with 5 `' ('` in it, so the column gets default numeric type [how to check dtype of column](https://stackoverflow.com/questions/22697773/how-to-check-the-dtype-of-a-column-in-python-pandas/22697903) - `None` is the "null" type fo string/object, `NaN` for numeric - feel free to [accept as answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) then ;) – Patrick Artner Jun 18 '18 at 08:03

Reading text file using read_csv with regex dilimiter

2 Answers2