0

I have a question about reading and creating a dataset. I have a text file which contains:

Sunny,Hot,High,Weak,No
Sunny,Hot,High,Strong,No

and I implemented this code like this:

from pyspark import SparkConf, SparkContext
import operator
import math

conf = SparkConf().setMaster("local[*]").setAppName("Lab 6")
sc = SparkContext(conf=conf)
rawData = sc.textFile("txtfile.data")
data = rawData.flatMap(lambda line: line.split(","))

instead of having a result like this:

[(Sunny, Hot, High, Weak, No), (Sunny, Hot, High, Strong, No)]

It gave me the result:

['Sunny', 'Hot', 'High', 'Weak', 'No', 'Sunny', 'Hot', 'High', 'Strong', 'No']

Can anyone show me how to fix this?

xilpex
  • 3,097
  • 2
  • 14
  • 45
GunFire
  • 19
  • 2

2 Answers2

1

Use map instead of flatMap.

data = rawData.map(lambda line: line.split(","))
#[['Sunny', 'Hot', 'High', 'Weak', 'No'], ['Sunny', 'Hot', 'High', 'Strong', 'No']]

#to get list of tuples
data = rawData.map(lambda line: tuple(line.split(",")))
#[('Sunny', 'Hot', 'High', 'Weak', 'No'), ('Sunny', 'Hot', 'High', 'Strong', 'No')]
notNull
  • 30,258
  • 4
  • 35
  • 50
  • 1
    It said that I havent reached 15 reputation to make this upvote public yet, sorry mate :( I had upvoted the moment I got ur reply :'( – GunFire May 20 '20 at 13:50
1

flatmap is the combination of map (transformation) and flatten, which will create a row for each element in the sub-array.

You want to use the map method that will generate a column of type Array of string.

Alfilercio
  • 1,088
  • 6
  • 13