How to parse tuple data from csv format in pyspark?

Question

The dataset is in CSV format. Every line in this file contains a tuple where the first element is the name of a plant, and the remaining elements are the states in which the plant is found.

Example:

abelia,fl,nc
abelia x grandiflora,fl,nc
abelmoschus,ct,dc,fl,hi,il,ky,la,md,mi,ms,nc,sc,va,pr,vi

How do I parse this and create a dataframe with columns plants and states(this will contain all the data except for the plant like a list)?

What have you tried? Where are you getting stuck? Can you share some code with us? What does your desired output look like? [How to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Mar 15 '18 at 17:53
I am stuck at parsing. Map is useless here since it will produce output of different lengths. I loaded it as rdd then tried parsing using map. — KAY_YAK, Mar 15 '18 at 18:03
What is the desired output? Do you want a dataframe with 51 columns (plant name plus 1 for each state) or 2 columns? In both cases, especially if it's the latter, map is not useless. Please also share the code you've tried. — pault, Mar 15 '18 at 18:13

score 0 · Accepted Answer · answered Mar 15 '18 at 19:51

You can do this with rdd.map() or using DataFrames and udf():

RDD

First create a sample dataset:

text = """abelia,fl,nc
abelia x grandiflora,fl,nc
abelmoschus,ct,dc,fl,hi,il,ky,la,md,mi,ms,nc,sc,va,pr,vi"""

rdd = sc.parallelize(map(lambda x: (x,), text.split("\n")))
rdd.toDF(["rawText"]).show(truncate=False)
#+--------------------------------------------------------+
#|rawText                                                 |
#+--------------------------------------------------------+
#|abelia,fl,nc                                            |
#|abelia x grandiflora,fl,nc                              |
#|abelmoschus,ct,dc,fl,hi,il,ky,la,md,mi,ms,nc,sc,va,pr,vi|
#+--------------------------------------------------------+

Now use map() twice. First to map each record to a list by splitting on ,. The second maps the split string into a tuple of the form (x[0], x[1:]):

rdd.map(lambda x: x[0].split(','))\
    .map(lambda x: (x[0], x[1:]))\
    .toDF(["plant", "states"])\
    .show(truncate=False)
#+--------------------+------------------------------------------------------------+
#|plant               |states                                                      |
#+--------------------+------------------------------------------------------------+
#|abelia              |[fl, nc]                                                    |
#|abelia x grandiflora|[fl, nc]                                                    |
#|abelmoschus         |[ct, dc, fl, hi, il, ky, la, md, mi, ms, nc, sc, va, pr, vi]|
#+--------------------+------------------------------------------------------------+

You could also have done this in one call to map() but I split it in two for readability.

Dataframe

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame(map(lambda x: (x,), text.split("\n")), ["rawText"])

# define udf to split a string on comma and return all
# of the elements except the first one
get_states = f.udf(lambda x: x.split(',')[1:], ArrayType(StringType()))

df.withColumn('plant', f.split('rawText', ',')[0])\
    .withColumn('states', get_states('rawText'))\
    .select('plant', 'states')\
    .show(truncate=False)
#+--------------------+------------------------------------------------------------+
#|plant               |states                                                      |
#+--------------------+------------------------------------------------------------+
#|abelia              |[fl, nc]                                                    |
#|abelia x grandiflora|[fl, nc]                                                    |
#|abelmoschus         |[ct, dc, fl, hi, il, ky, la, md, mi, ms, nc, sc, va, pr, vi]|
#+--------------------+------------------------------------------------------------+

How to parse tuple data from csv format in pyspark?

1 Answers1

RDD

Dataframe