4

I currently have the following data.csv which has a comma delimiter:

name,day
Chicken Sandwich,Wednesday
Pesto Pasta,Thursday
Lettuce, Tomato & Onion Sandwich,Friday
Lettuce, Tomato & Onion Pita,Friday
Soup,Saturday

The parser script is:

import pandas as pd


df = pd.read_csv('data.csv', delimiter=',', error_bad_lines=False, index_col=False)
print(df.head(5))

The output is:

Skipping line 4: expected 2 fields, saw 3
Skipping line 5: expected 2 fields, saw 3

               name        day
0  Chicken Sandwich  Wednesday
1       Pesto Pasta   Thursday
2              Soup   Saturday

How do I handle the case Lettuce, Tomato & Onion Sandwich. Each item should be separated by , but it's possible that an item has a comma in it followed by a space. The desired output is:

                               name        day
0                  Chicken Sandwich  Wednesday
1                       Pesto Pasta   Thursday
2  Lettuce, Tomato & Onion Sandwich     Friday
3      Lettuce, Tomato & Onion Pita     Friday
4                              Soup   Saturday
Thierry Lam
  • 45,304
  • 42
  • 117
  • 144
  • That's a tough one. How do you suggest pandas should know whether the comma is within a field or outside a field? – jpp Feb 27 '18 at 16:59
  • You may need to do some csv pre-processing, see https://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas – jpp Feb 27 '18 at 17:10

2 Answers2

1

An alternative that works in other situations too. OK, it's ugly.

import pandas as pd
from io import StringIO

for_pd = StringIO()
with open('theirry.csv') as input:
    for line in input:
        line = line.rstrip().replace(', ', '|||').replace(',', '```').replace('|||', ', ').replace('```', '|')
        print (line, file=for_pd)
for_pd.seek(0)

df = pd.read_csv(for_pd, sep='|')

print (df)

Result:

                               name        day
0                  Chicken Sandwich  Wednesday
1                       Pesto Pasta   Thursday
2  Lettuce, Tomato & Onion Sandwich     Friday
3      Lettuce, Tomato & Onion Pita     Friday
4                              Soup   Saturday
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

This might help.

import pandas as pd
p = "PATH_TO.csv"
df = pd.read_csv(p, delimiter='(,(?=\S)|:)')
#print(df.head(5))
print "-----"
print df["name"]
print "-----"
print df["day"]

Output:

-----
0                    Chicken Sandwich
1                         Pesto Pasta
2    Lettuce, Tomato & Onion Sandwich
3        Lettuce, Tomato & Onion Pita
4                                Soup
Name: name, dtype: object
-----
0    Wednesday
1     Thursday
2       Friday
3       Friday
4     Saturday
Name: day, dtype: object
Rakesh
  • 81,458
  • 17
  • 76
  • 113