Pyspark (Dataframes) read file line wise (Convert row to string)

Question

I need to read a file line wise and split each line into words and perform operations on words.

How do I do that?

I wrote the below code:

logFile = "/home/hadoop/spark-2.3.1-bin-hadoop2.7/README.md"  # Should be 
some file on your system
spark = SparkSession.builder.appName("SimpleApp1").getOrCreate()
logData = spark.read.text(logFile).cache()
logData.printSchema()
logDataLines = logData.collect()

#The line variable below seems to be of type row. How I perform similar operations 
on row or how do I convert row to a string.

for line in logDataLines:
    words = line.select(explode(split(line,"\s+")))
    for word in words:
        print(word)
    print("----------------------------------")

By using `collect()` you will collect all data on the driver node, i.e. if you do it like that there would be no need to use Spark. This question shows how to split a column in a dataframe and explode it: https://stackoverflow.com/questions/38210507/explode-in-pyspark — Shaido, Aug 28 '18 at 01:52

score 2 · Answer 1 · answered Sep 04 '18 at 13:29

I think you should apply a map function to your rows. You can apply anything in the self-created function:

data = spark.read.text("/home/spark/test_it.txt").cache()

def someFunction(row):
    wordlist = row[0].split(" ")
    result = list()
    for word in wordlist:
        result.append(word.upper())
    return result

data.rdd.map(someFunction).collect()

Output:

[[u'THIS', u'IS', u'JUST', u'A', u'TEST'], [u'TO', u'UNDERSTAND'], [u'THE', u'PROCESSING']]

Pyspark (Dataframes) read file line wise (Convert row to string)

1 Answers1