I have a local text file kv_pair.log
formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs
is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.