How to get last row data from Pyspark dataframe and then delete it

Asked Jul 16 '19 at 07:27

Active Jul 16 '19 at 09:56

Viewed 1,483 times

I read in a csv file that has a tail looks like below. It's a wide table with about 50 columns so I didn't list them all. Line starts with H is the header, and the last line starts with T is the footer. I am trying to get the number from last line, 2 in this example, which is the row count. After checking the row count I want to delete the last line (headerline was removed when read the dataframe).

Is there a way of getting the number and removing the last line without converting the dataframe back to an RDD? I saw this question here but wonder if it can be done without the monotonically_increasing_idHow to select last row and also how to access PySpark dataframe by index? Many thanks for your help.

Edited: Zipwithindex is for rdd not dataframe right? I hope not having to convert it to RDD and then back again

H~headerString~201908~stringE
D~stringA~stringB~stringC
D~stringAA~stringBB~stringCC
T~2~stringD~footerString

edited Jul 16 '19 at 09:56

asked Jul 16 '19 at 07:27

user4046073

Everything that can be done is described in that answer. Like zero323 says, you can use zipWithIndex... – eliasah Jul 16 '19 at 08:47
This still doesn't change the answer. – eliasah Jul 16 '19 at 11:53
1

To get the row-count `2`, use `df.where('c1 == "T"').first().c2` where c1, c2 are the column names of the 1st and 2nd field in your sample data. to filter out footer, just `df.where('c1 != "T"')` – jxc Jul 16 '19 at 12:34

How to get last row data from Pyspark dataframe and then delete it

0 Answers0