Convert multiple RDD rows into one Row in pyspark

Question

I have an unstructured CSV file with unequal lines. I want to structure the file in such a way as to combine lines through some word that are in lines. For example: if "A" is in line, this marks the biginning of the line and if "B" is in line, this marks the end of the line. I have being searching for two days a way to do this in pyspark but I have no solution. Any help will be great.

The content of the csv file looks like this:

STARTING |1|TH|TGG|132|8|T|Fall|
EVENT 1|56|HT|JUP||||||||
EVENT 2|BHT|987|231|||||||||||||||||
STOP|HFR|0.5|90|
STARTING |8|TH|TGG|12|8|T|Fall|
EVENT 1|6|HT|UP||||||||
EVENT 2|BT|987|31|||||||||||||||||
STOP|FR|0.5|90|

I want to have this: enter image description here I have created the following function in PYSPARK but it does not do the job :)

rdd_1 = ("filename/Path")
Def tranform(line)
lines = line .replace ("\n").replace("\r")
START = False
END = False
New_line =[]
AGG = []

for line in lines:
    if "STARTING" in line:
        START = True
    if "STOP" in line:
        END = True
    if START and END:
        New_line.append(line)
    AGG = ' ' join (New_line)
    Return (AGG)
rdd_2 = rdd_1.map(transform)

Possible duplicate of [creating spark data structure from multiline record](https://stackoverflow.com/questions/31227363/creating-spark-data-structure-from-multiline-record) — Alper t. Turker, Jun 14 '18 at 20:54
No I tried the solution "creating spark data structure from multiline record – user8371915" but it did not work for me — Mame Silmang Diouf, Jun 15 '18 at 12:44

Convert multiple RDD rows into one Row in pyspark

0 Answers0