I read in a csv file that has a tail looks like below. It's a wide table with about 50 columns so I didn't list them all. Line starts with H is the header, and the last line starts with T is the footer. I am trying to get the number from last line, 2
in this example, which is the row count. After checking the row count I want to delete the last line (headerline was removed when read the dataframe).
Is there a way of getting the number and removing the last line without converting the dataframe back to an RDD? I saw this question here but wonder if it can be done without the monotonically_increasing_id
How to select last row and also how to access PySpark dataframe by index? Many thanks for your help.
Edited: Zipwithindex is for rdd not dataframe right? I hope not having to convert it to RDD and then back again
H~headerString~201908~stringE
D~stringA~stringB~stringC
D~stringAA~stringBB~stringCC
T~2~stringD~footerString