1

Dealing with csv file that has text data of novels.

book_id, title, content
1, book title 1, All Passion Spent is written in three parts, primarily from the view of an intimate observer. 
2, Book Title 2,  In particular Mr FitzGeorge, a forgotten acquaintance from India who has ever since been in love with her, introduces himself and they form a quiet but playful and understanding friendship. It cost 3,4234 to travel. 

Text in content column have commas and unfortunately when you try to use pandas.read_csv you get pandas.errors.ParserError: Error tokenizing data. C error:

There are some solutions to this problem SO but none of them worked. Tried to read as a regular file and then passed to data frame failed. SO - Solution

add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • 1
    Are there ever commas in the `id` or `title`? – user3483203 May 03 '18 at 15:36
  • You are getting the error because there is an extra comma in `been in love with her, introduces h` – Rakesh May 03 '18 at 15:38
  • Can you replace the first 2 commas with a random delimiters like `@` and change the default delimiter in the csv parser? `pandas.csv_reaser(filename, sep='@')` and `line.replace(',', '@', maxreplace=2)`. If there is comma in title, you'll need a regex replace to match the title. – TwistedSim May 03 '18 at 15:39
  • @chrisz there can be separators in the title – add-semi-colons May 03 '18 at 15:41
  • @Rakesh basically index mismatch right more columns than what is in the header. – add-semi-colons May 03 '18 at 15:43
  • As is, with commas in the both title and content, you're not ever going to read it correctly. You'll have to recreate it to use a delimiter other than comma, such as a pipe (|), or have the values quoted to protect the embedded commas. – floydn May 03 '18 at 15:54
  • That's not a valid `csv` file. Embedded column markers should be escaped so you don't have this problem. So, you'll have to hack. – tdelaney May 03 '18 at 15:56

1 Answers1

1

You can try reading your file and then spliting the content using str.split(",", 2) and then convert the result to a DF.

Ex:

import pandas as pd
content = []
with open(filename, "r") as infile:
    header = infile.readline().strip().split(",")
    content = [i.strip().split(",", 2) for i in infile.readlines()]

df = pd.DataFrame(content, columns=header)
print(df)

Output:

  book_id          title                                            content
0       1   book title 1   All Passion Spent is written in three parts, ...
1       2   Book Title 2    In particular Mr FitzGeorge, a forgotten acq...
Rakesh
  • 81,458
  • 17
  • 76
  • 113