0

I have a csv file and I'm loading it as follows:

sc.textFile("market.csv").take(3)

The output is this:

['"ID","Area","Postcode","Amount"',
'"1234/232","City","8479","20000"',
'"5987/215","Metro","1111","25000"']

Also, loading with map operation:

sc.textFile("market.csv").map(lambda line: line.split(","))

Gives me:

[['"ID"','"Area"','"Postcode"','"Amount"'],
['"1234/232"','"City"','"8479"','"20000"'],
['"5987/215"','"Metro"','"1111"','"25000"']]

This is too many " " and ' ' and does not let me analyze my results!!

I want to have an output like this:

[["ID","Area","Postcode","Amount"],
["1234/232","City",8479,20000],
["5987/215","Metro",1111,25000]]

In which the text values are string type, and the numbers are int/double type.

How can I do that? Thanks.

mah65
  • 578
  • 10
  • 20
  • Why not use the data frame API for this? – ernest_k Aug 23 '20 at 01:20
  • Thanks mate. I have to use RDD in this case. – mah65 Aug 23 '20 at 01:28
  • 1
    If you need to use the RDD API, you have to parse the information yourself. Try `sc.textFile('market.csv').filter(lambda l: l.find('ID')<0).map(lambda l: l.replace('"', '').split(',')).map(lambda l: [l[0], l[1], int(l[2]), int(l[3])])` – ernest_k Aug 23 '20 at 01:36
  • Does this answer your question? [Spark - load CSV file as DataFrame?](https://stackoverflow.com/questions/29704333/spark-load-csv-file-as-dataframe) – Felix Aug 23 '20 at 01:40

1 Answers1

0

Here is the way. You should do it manually.

rdd = sc.textFile("test.csv")
rdd = rdd.map(lambda line: line.replace('\"','').split(','))

def isHeader(row): return 'ID' in str(row)
    
rdd1 = rdd.filter(isHeader)
rdd2 = rdd.filter(lambda x: not(isHeader(x))).map(lambda line: [line[0], line[1], int(line[2]), int(line[3])])

rdd1.union(rdd2).collect()


[['ID', 'Area', 'Postcode', 'Amount'],
 ['1234/232', 'City', 8479, 20000],
 ['5987/215', 'Metro', 1111, 25000]]
Lamanus
  • 12,898
  • 4
  • 21
  • 47