-1

I am trying to read a csv to a rdd(SPARK) using python. The issue that i am having is while using the split function with comma as a delimiter. This works fine as long as there is no comma in each column. if there are commas, the comma splits each column into multiple columns.

e.g.

empid, emp title, emp desc, college 123, developer, the role of developer is to develop softwares using languages such as C, C++ etc, college1

data = sc.textfile("files.csv")
empid, emp title, emp desc, college = line.strip().split(",")

in the above example the emp desc is split out to college also, please let me know how to handle commas within each column while reading the dataset?

soya666
  • 361
  • 1
  • 4
  • 18
Sonali
  • 131
  • 1
  • 2
  • 7

1 Answers1

0

It's not really possible to know which commas are supposed to be delimiters and which are not without additional information. Your best bet would probably be to just change the delimiter or to make sure that all non-delimiter commas are "escaped" in some way upon entry.

Solution using an escape:

Provided that all non-delimiter commas are prefixed with something, for example "\," then you can split by comma and join any entry that starts with the escape \

line = '123, developer, the role of developer is to develop softwares using languages such as C\\, C++ etc, college1'

temp = line.strip().split(',')

i=0
while i < len(temp)-1:
    if temp[i][-1] == '\\':
        temp[i:i+2] = [','.join(temp[i:i+2])]
    else:
        temp[i] = ','.join(temp[i].split('\\,'))
        i += 1

empid, emp_title, emp_desc, college = temp
print('empid: '+empid+'\nemp_title: '+emp_title+'\nemp_desc: '+emp_desc+'\ncollege: '+college)

output:

empid: 123
emp_title:  developer
emp_desc:  the role of developer is to develop softwares using languages such as C, C++ etc
college:  college1

Solution using additional information:

On the other hand, if you can't use an escape for some reason for non-delimiter commas then your next best choice is to impose additional information. For example if you are reasonably confident that only the emp_desc variable will have non-delimiter commas then you could always do something like this:

temp = line.strip().split(",")
empid = temp[0]
emp_title = temp[1]
emp_desc = temp[2:len(temp)-1]
college = temp[-1]
Isoloid
  • 86
  • 2