DataFrame-ified zipWithIndex

Question

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RDD.zipWithIndex. On the other hand, the following works more or less the way I want it to:

val origDF = sqlContext.load(...)    

val seqDF= sqlContext.createDataFrame(
    origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),
    StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields)
)

In my actual application, origDF won't be loaded directly out of a file -- it is going to be created by joining 2-3 other DataFrames together and will contain upwards of 100 million rows.

Is there a better way to do this? What can I do to optimize it?

score 39 · Answer 1 · edited Nov 11 '16 at 08:04

The following was posted on behalf of the David Griffin (edited out of question).

The all-singing, all-dancing dfZipWithIndex method. You can set the starting offset (which defaults to 1), the index column name (defaults to "id"), and place the column in the front or the back:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.Row


def dfZipWithIndex(
  df: DataFrame,
  offset: Int = 1,
  colName: String = "id",
  inFront: Boolean = true
) : DataFrame = {
  df.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map(ln =>
      Row.fromSeq(
        (if (inFront) Seq(ln._2 + offset) else Seq())
          ++ ln._1.toSeq ++
        (if (inFront) Seq() else Seq(ln._2 + offset))
      )
    ),
    StructType(
      (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
        ++ df.schema.fields ++ 
      (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
    )
  ) 
}

@eliasah -- I found a `Window` expression way to do this. It is much slower, however, but figured you might want to take a look. See answer below. — David Griffin, May 07 '16 at 11:36
That's awesome. Any references to a PySpark version? Thanks for sharing. — Tagar, Jan 15 '18 at 03:48
This looks very smooth. Can someone help me to write this in JAVA — John Humanyun, Mar 19 '19 at 10:45

score 14 · Accepted Answer · edited Jun 20 '20 at 09:12

Since Spark 1.6 there is a function called monotonically_increasing_id()
It generates a new column with unique 64-bit monotonic index for each row
But it isn't consequential, each partition starts a new range, so we must calculate each partition offset before using it.
Trying to provide an "rdd-free" solution, I ended up with some collect(), but it only collects offsets, one value per partition, so it will not cause OOM

def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())

    val partitionOffsets = dfWithPartitionId
        .groupBy("partition_id")
        .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
        .orderBy("partition_id")
        .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
        .collect()
        .map(_.getLong(0))
        .toArray
        
     dfWithPartitionId
        .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
        .withColumn(indexName, col("partition_offset") + col("inc_id"))
        .drop("partition_id", "partition_offset", "inc_id")
}

This solution doesn't repack the original rows and doesn't repartition the original huge dataframe, so it is quite fast in real world: 200GB of CSV data (43 million rows with 150 columns) read, indexed and packed to parquet in 2 minutes on 240 cores
After testing my solution, I have run Kirk Broadhurst's solution and it was 20 seconds slower
You may want or not want to use dfWithPartitionId.cache(), depends on task

First of all, there can be an error, if some partition gets no rows of `df`. Please, check answer by @fylb with explanation of how to fix it. Second, the last `as "cnt"` is quite confusing there and, since it is not used after, can and worth to be removed. Third, thanks for very useful, interesting and quite elegant answer! — Aleksei Alefirov, Feb 01 '19 at 15:55
Let's say I'm creating indexes for column "product_id" with `monotonically_increasing_id()`, my question is does this function guarantee that the same product_id will have the same index across partitions/nodes? — SarahData, Feb 13 '19 at 14:13
@SarahData no, `monotonically_increasing_id()` is absolutely unique across all the rows, it doesn't depend on any other columns, so different rows with the same `product_id` will always have different `monotonically_increasing_id()`'s — Evgenii Glotov, Feb 21 '19 at 23:06

score 9 · Answer 3 · edited Jul 22 '16 at 08:11

9

Starting in Spark 1.5, Window expressions were added to Spark. Instead of having to convert the DataFrame to an RDD, you can now use org.apache.spark.sql.expressions.row_number. Note that I found performance for the the above dfZipWithIndex to be significantly faster than the below algorithm. But I am posting it because:

Someone else is going to be tempted to try this
Maybe someone can optimize the expressions below

At any rate, here's what works for me:

import org.apache.spark.sql.expressions._

df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))

Note that I use lit(1) for both the partitioning and the ordering -- this makes everything be in the same partition, and seems to preserve the original ordering of the DataFrame, but I suppose it is what slows it way down.

I tested it on a 4-column DataFrame with 7,000,000 rows and the speed difference is significant between this and the above dfZipWithIndex (like I said, the RDD functions is much, much faster).

edited Jul 22 '16 at 08:11

schot

10,958
2
46
71

answered May 07 '16 at 11:35

David Griffin

13,677
5
47
65

3

Wouldn't that cause OOM error if the dataset doesn't fit in a single worker's memory? – Daniel de Paula May 09 '16 at 14:12
1

I have no idea -- I just know that it is much, much slower than the `RDD` based `zipWithIndex`, and that was more than enough for me to stop thinking about it. I posted the above so that other people wouldn't be tempted to go too far down this path; the original `dfZipWithIndex` still seems to be the best approach. – David Griffin May 09 '16 at 14:19
Thanks for sharing this, I thought the way of not converting DF to RDD will be faster at first, and won't go too far on that way now. – Robin Wang Dec 23 '16 at 08:48
two concerns with this approach: 1) window functions need order by clause - so data set will be sorted - which is why I guess you found that dfZipWithIndex is faster; 2) sometimes you don't have a column to order-by on; for example, you would need a native order of lines as it was in a file, so then dfZipWithIndex might be the only choice.. – Tagar Jan 15 '18 at 03:51
with spark 2.3.0, does the worker not spill to disk if the data exceeds it's memory capacity? – s.r May 16 '19 at 21:22
I was using row_number.over(Window.orderBy(lit(1))), It doesn't have partitionBy, and found it doesn't keep the original record Line Number. However, the requirement for me is to pickup the 1st record in the file, I was usign spark. so I refer to this :https://stackoverflow.com/questions/40898344/would-zipwithindex-after-sc-textfile-give-the-correct-line-numbers – Emma Y Dec 15 '22 at 02:58

score 7 · Answer 4 · answered Jan 15 '18 at 06:58

PySpark version:

from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
        and preserves a schema

        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''

    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )

    zipped_rdd = df.rdd.zipWithIndex()

    new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))

    return spark.createDataFrame(new_rdd, new_schema)

Also created a jira to add this functionality in Spark natively: https://issues.apache.org/jira/browse/SPARK-23074

score 5 · Answer 5 · answered Mar 21 '18 at 13:30

@Evgeny , your solution is interesting. Notice that there is a bug when you have empty partitions (the array is missing these partition indexes, at least this is happening to me with spark 1.6), so I converted the array into a Map(partitionId -> offsets).

Additionnally, I took out the sources of monotonically_increasing_id to have "inc_id" starting from 0 in each partition.

Here is an updated version:

import org.apache.spark.sql.catalyst.expressions.LeafExpression
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.catalyst.expressions.Nondeterministic
import org.apache.spark.sql.catalyst.expressions.codegen.GeneratedExpressionCode
import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.expressions.Window

case class PartitionMonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {

  /**
   * From org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
   *
   * Record ID within each partition. By being transient, count's value is reset to 0 every time
   * we serialize and deserialize and initialize it.
   */
  @transient private[this] var count: Long = _

  override protected def initInternal(): Unit = {
    count = 1L // notice this starts at 1, not 0 as in org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
  }

  override def nullable: Boolean = false

  override def dataType: DataType = LongType

  override protected def evalInternal(input: InternalRow): Long = {
    val currentCount = count
    count += 1
    currentCount
  }

  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
    val countTerm = ctx.freshName("count")
    ctx.addMutableState(ctx.JAVA_LONG, countTerm, s"$countTerm = 1L;")
    ev.isNull = "false"
    s"""
      final ${ctx.javaType(dataType)} ${ev.value} = $countTerm;
      $countTerm++;
    """
  }
}

object DataframeUtils {
  def zipWithIndex(df: DataFrame, offset: Long = 0, indexName: String = "index") = {
    // from https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex)
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", new Column(PartitionMonotonicallyIncreasingID()))

    // collect each partition size, create the offset pages
    val partitionOffsets: Map[Int, Long] = dfWithPartitionId
      .groupBy("partition_id")
      .agg(max("inc_id") as "cnt") // in each partition, count(inc_id) is equal to max(inc_id) (I don't know which one would be faster)
      .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") + lit(offset) as "cnt")
      .collect()
      .map(r => (r.getInt(0) -> r.getLong(1)))
      .toMap

    def partition_offset(partitionId: Int): Long = partitionOffsets(partitionId)
    val partition_offset_udf = udf(partition_offset _)
    // and re-number the index
    dfWithPartitionId
      .withColumn("partition_offset", partition_offset_udf(col("partition_id")))
      .withColumn(indexName, col("partition_offset") + col("inc_id"))
      .drop("partition_id")
      .drop("partition_offset")
      .drop("inc_id")
  }
}

score 3 · Answer 6 · answered Jan 28 '20 at 06:20

I have modified @Tagar's version to run on Python 3.7, wanted to share:

def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
    Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
    and preserves a schema

    :param df: source dataframe
    :param offset: adjustment to zipWithIndex()'s index
    :param colName: name of the index column
'''

new_schema = StructType(
                [StructField(colName,LongType(),True)]        # new added field in front
                + df.schema.fields                            # previous schema
            )

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))      # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
return spark.createDataFrame(new_rdd, new_schema)

canberker · Answer 7 · 2019-04-15T09:44:07.677

Spark Java API version:

I have implemented @Evgeny's solution for performing zipWithIndex on DataFrames in Java and wanted to share the code.

It also contains the improvements offered by @fylb in his solution. I can confirm for Spark 2.4 that the execution fails when the entries returned by spark_partition_id() do not start with 0 or do not increase sequentially. As this function is documented to be non-deterministic, it is very likely that one of the above cases will occur. One example is triggered by increasing the partition count.

The java implementation is given below:

public static Dataset<Row> zipWithIndex(Dataset<Row> df, Long offset, String indexName) {
        Dataset<Row> dfWithPartitionId = df
                .withColumn("partition_id", spark_partition_id())
                .withColumn("inc_id", monotonically_increasing_id());

        Object partitionOffsetsObject = dfWithPartitionId
                .groupBy("partition_id")
                .agg(count(lit(1)).alias("cnt"), first("inc_id").alias("inc_id"))
                .orderBy("partition_id")
                .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")).minus(col("cnt")).minus(col("inc_id")).plus(lit(offset).alias("cnt")))
                .collect();
        Row[] partitionOffsetsArray = ((Row[]) partitionOffsetsObject);
        Map<Integer, Long> partitionOffsets = new HashMap<>();
        for (int i = 0; i < partitionOffsetsArray.length; i++) {
            partitionOffsets.put(partitionOffsetsArray[i].getInt(0), partitionOffsetsArray[i].getLong(1));
        }

        UserDefinedFunction getPartitionOffset = udf(
                (partitionId) -> partitionOffsets.get((Integer) partitionId), DataTypes.LongType
        );

        return dfWithPartitionId
                .withColumn("partition_offset", getPartitionOffset.apply(col("partition_id")))
                .withColumn(indexName, col("partition_offset").plus(col("inc_id")))
                .drop("partition_id", "partition_offset", "inc_id");
    }

bonnal-enzo · Answer 8 · 2020-05-06T18:51:52.840

Here is my proposal, the advantages of which are:

It does not involve any serialization/deserialization^[1] of our DataFrame's InternalRows.
Its logic is minimalist by relying only on RDD.zipWithIndex.

Its major down sides are:

It is impossible to use it directly from non-JVM APIs (pySpark, SparkR).
It has to be under the package org.apache.spark.sql;.

imports:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.functions.lit

/**
  * Optimized Spark SQL equivalent of RDD.zipWithIndex.
  *
  * @param df
  * @param indexColName
  * @return `df` with a column named `indexColName` of consecutive unique ids.
  */
def zipWithIndex(df: DataFrame, indexColName: String = "index"): DataFrame = {
  import df.sparkSession.implicits._

  val dfWithIndexCol: DataFrame = df
    .drop(indexColName)
    .select(lit(0L).as(indexColName), $"*")

  val internalRows: RDD[InternalRow] = dfWithIndexCol
    .queryExecution
    .toRdd
    .zipWithIndex()
    .map {
      case (internalRow: InternalRow, index: Long) =>
        internalRow.setLong(0, index)
        internalRow
    }

  Dataset.ofRows(
    df.sparkSession,
    LogicalRDD(dfWithIndexCol.schema.toAttributes, internalRows)(df.sparkSession)
  )

^[1]: (from/to InternalRow's underlying bytes array <--> GenericRow's underlying JVM objects collection Array[Any]).

score 1 · Answer 9 · answered Oct 21 '22 at 14:42

I've ported @canberker suggestion to Python 3 (pyspark).

Also, instead of using UDF with hash map, I used broadcast join which slightly improved performance during my testings.

NOTE: This solution still suffers from gaps due to empty partitions.

def zip_with_index(df, offset: int = 1, index_name: str = "id"):
    df_with_partition_id = (
        df
        .withColumn("partition_id", F.spark_partition_id())
        .withColumn("inc_id", F.monotonically_increasing_id())
    )
    partition_offsets_df = (
        df_with_partition_id
        .groupBy("partition_id")
        .agg(F.count(F.lit(1)).alias("cnt"), F.first("inc_id").alias("inc_id"))
        .orderBy("partition_id")
        .select(
            F.col("partition_id"),
            (
                F.sum("cnt").over(Window.orderBy("partition_id"))
                - F.col("cnt") - F.col("inc_id") + F.lit(offset)
            ).alias("partition_offset")
        )
    )

    res = (
        df_with_partition_id
        .join(partition_offsets_df.hint("broadcast"), on="partition_id")
        .withColumn(index_name, F.col("partition_offset") + F.col("inc_id"))
        .drop("partition_id", "partition_offset", "inc_id")
    )
    return res

DataFrame-ified zipWithIndex

9 Answers9

Linked

Related