I have an unstructured CSV file with unequal lines. I want to structure the file in such a way as to combine lines through some word that are in lines. For example: if "A" is in line, this marks the biginning of the line and if "B" is in line, this marks the end of the line. I have being searching for two days a way to do this in pyspark but I have no solution. Any help will be great.
The content of the csv file looks like this:
STARTING |1|TH|TGG|132|8|T|Fall|
EVENT 1|56|HT|JUP||||||||
EVENT 2|BHT|987|231|||||||||||||||||
STOP|HFR|0.5|90|
STARTING |8|TH|TGG|12|8|T|Fall|
EVENT 1|6|HT|UP||||||||
EVENT 2|BT|987|31|||||||||||||||||
STOP|FR|0.5|90|
I want to have this: enter image description here I have created the following function in PYSPARK but it does not do the job :)
rdd_1 = ("filename/Path")
Def tranform(line)
lines = line .replace ("\n").replace("\r")
START = False
END = False
New_line =[]
AGG = []
for line in lines:
if "STARTING" in line:
START = True
if "STOP" in line:
END = True
if START and END:
New_line.append(line)
AGG = ' ' join (New_line)
Return (AGG)
rdd_2 = rdd_1.map(transform)