6

I have a DataFrame created by running sqlContext.read of a Parquet file.

The DataFrame consists of 300 M rows. I need to use these rows as input to another function, but I want to do it in smaller batches to prevent OOM error.

Currently, I am using df.head(1000000) to read the first 1M rows, but I cannot find a way to read the subsequent rows. I tried df.collect(), but it gives me a Java OOM error.

I want to iterate over this dataframe. I tried adding another column with the withColumn() API to generate a unique set of values to iterate over, but none of the existing columns in the dataframe have solely unique values.

For example, I tried val df = df1.withColumn("newColumn", df1("col") + 1) as well as val df = df1.withColumn("newColumn",lit(i+=1)), both of which do not return a sequential set of values.

Any other way to get the first n rows of a dataframe and then the next n rows, something that works like a range function of SqlContext?

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121

1 Answers1

8

You can simple use the limit and except api of dataset or dataframes as follows

long count = df.count();
int limit = 50;
while(count > 0){
    df1 = df.limit(limit);
    df1.show();            //will print 50, next 50, etc rows
    df = df.except(df1);
    count = count - limit;
}
Sandeep Purohit
  • 3,652
  • 18
  • 22
  • Thanks a lot! Made small changes to get this to work: var count:long = df.count(); val limit:Int = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; } – newbie_learner Sep 02 '16 at 19:32
  • No, it will not if you want to remove duplicate rows you just create new dataframe by using distinct on the current dataframe – Sandeep Purohit Jul 23 '18 at 16:24