Read text file to tuple pyspark

Question

I have a question about reading and creating a dataset. I have a text file which contains:

Sunny,Hot,High,Weak,No
Sunny,Hot,High,Strong,No

and I implemented this code like this:

from pyspark import SparkConf, SparkContext
import operator
import math

conf = SparkConf().setMaster("local[*]").setAppName("Lab 6")
sc = SparkContext(conf=conf)
rawData = sc.textFile("txtfile.data")
data = rawData.flatMap(lambda line: line.split(","))

instead of having a result like this:

[(Sunny, Hot, High, Weak, No), (Sunny, Hot, High, Strong, No)]

It gave me the result:

['Sunny', 'Hot', 'High', 'Weak', 'No', 'Sunny', 'Hot', 'High', 'Strong', 'No']

Can anyone show me how to fix this?

notNull · Answer 1 · 2020-05-19T22:19:11.020

1

Use map instead of flatMap.

data = rawData.map(lambda line: line.split(","))
#[['Sunny', 'Hot', 'High', 'Weak', 'No'], ['Sunny', 'Hot', 'High', 'Strong', 'No']]

#to get list of tuples
data = rawData.map(lambda line: tuple(line.split(",")))
#[('Sunny', 'Hot', 'High', 'Weak', 'No'), ('Sunny', 'Hot', 'High', 'Strong', 'No')]

edited May 19 '20 at 22:19

answered May 19 '20 at 22:13

notNull

30,258
4
35
50

1

It said that I havent reached 15 reputation to make this upvote public yet, sorry mate :( I had upvoted the moment I got ur reply :'( – GunFire May 20 '20 at 13:50

score 1 · Answer 2 · answered May 19 '20 at 22:18

1

flatmap is the combination of map (transformation) and flatten, which will create a row for each element in the sub-array.

You want to use the map method that will generate a column of type Array of string.

answered May 19 '20 at 22:18

Alfilercio

1,088
6
13

Read text file to tuple pyspark

2 Answers2