1

I have a csv file that contains client number, address and premium. It's exported from another system, so the fields have quotes. While importing in R, I face an issue due to unwanted quotes and commas inside string. See example (Client 1 and 2 are correct, client 3 has the issue)

Client number Address                               Premium
"1"            "Building5, Street 30,NY"              1000
"2"           "Building7, Street 10,NY"               1000
"3"           "Building 7\", Street 10,NY"            1000

Because of this , R reads it as a new 4th column which I don't want. How do I get rid of it programmatically. I am ok with an R or Python based solution that cleans up the csv . If correcting the problem is difficult, even deleting client 3 is an acceptable solution

Tried doing this in python but doesn't help

import csv

def remove_special_prob(s):
return ''.join(c for c in s if c not in ('\"'))


with open("Client.csv","rb") as infile, open("Client_new.csv","wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for line in reader:
    writer.writerow([remove_special_prob(elem) for elem in line])
sourav
  • 179
  • 1
  • 1
  • 14

3 Answers3

1

If it's in the actual csv, then use return s.replace('\"', '').

Your code isn't working because you are iterating over each letter, but then comparing it so 2 letters, viz., \" together. Basically, your c will become \ first and then " over the iteration, but would never equal \".

Edit

For replacing -

def remove_special_prob(s):
    return s.replace('\"', '')

For skipping special problem lines

with open("Client.csv","rb") as infile, open("Client_new.csv","wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for line in reader:
         if line.count('"') > 6:
             continue
         writer.writerow(line)
shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90
  • Doesn't seem to solve the problem. There are few rows that are extremely troublesome. Is there a way to do this - I know the number of columns (3) . Therefore expected number of quotes should be 6 in a valid record. How can I remove all lines from the csv that have >6 quotes (") ? – sourav Mar 21 '17 at 13:22
  • Skip the line with a `continue` when `line.count('"') > 6`. – shad0w_wa1k3r Mar 21 '17 at 13:26
  • `with open("Client.csv","rb") as infile, open("Client_new.csv","wb") as outfile: filtered_csv = (line for line in infile if not line.count('"') > 42) reader = csv.reader(filtered_csv) writer = csv.writer(outfile, quoting=csv.QUOTE_ALL) for line in reader: writer.writerow()` tried this but does seem to work (sorry beginner in python) – sourav Mar 21 '17 at 13:46
  • Updated my answer for that, kindly check. – shad0w_wa1k3r Mar 21 '17 at 13:57
1

In Python you can use pandas's read_csv which is quite versatile

import pandas as pd
from io import StringIO

text = StringIO("""Client number\tAddress\tPremium
"1"\t"Building5, Street 30,NY"\t1000
"2"\t"Building7, Street 10,NY"\t1000
"3"\t"Building 7\", Street 10,NY"\t1000""")

df = pd.read_csv(text, sep='\t')
df['Address'] = df['Address'].str.replace('"', '')
# df.to_clipboard()

results in

    Client number   Address Premium
0   1   Building5, Street 30,NY 1000
1   2   Building7, Street 10,NY 1000
2   3   Building 7, Street 10,NY    1000
Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • The challenge is I need to import this csv to R to run some pre-written code. If I import to pandas, I'll have to write back to csv and import back in R . Now my address field already has multiple commas (that are valid). Exporting to csv again will mess it up due to these commas – sourav Mar 21 '17 at 13:07
  • you can have python export it as csv with `;` as seperator – Maarten Fabré Mar 22 '17 at 08:44
0

In R, you can read the file using readLines, remove the string \", and convert it to a data frame:

txt <- readLines("client.csv")
df <- read.csv(text=gsub('\\\\\"', "", txt))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Tried readlines but the second step takes an enormous amount of time My code was read.table(textConnection(txt)) – sourav Mar 21 '17 at 10:58
  • It takes the same amount of time as it would to read the file directly. If you want faster alternatives to `read.csv`, ask another question. – Hong Ooi Mar 21 '17 at 10:59