In Apache Spark, how to convert a slow RDD/dataset into a stream?

Question

I'm investigating an interesting case that involves wide transformations (e.g. repartition & join) on a slow RDD or dataset, e.g. the dataset defined by the following code:

val ds = sqlContext.createDataset(1 to 100)
  .repartition(1)
  .mapPartitions { itr =>
    itr.map { ii =>
      Thread.sleep(100)
      println(f"skewed - ${ii}")
      ii
    }
  }

The slow dataset is relevant as it resembles a view of a remote data source, and the partition iterator is derived from a single-threaded network protocol (http, jdbc etc.), in this case, the speed of download > the speed of single-threaded processing, but << the speed of distributed processing.

Unfortunately the conventional Spark computation model won't be efficient on a slow dataset because we are confined to one of the following options:

Use only narrow transformations (flatMap-ish) to pipe the stream with data processing end-to-end in a single thread, obviously the data processing will be a bottle neck and resource utilisation will be low.
Use a wide operation (repartitioning included) to balance the RDD/dataset, while this is essential for parallel data processing efficiency, the Spark coarse-grained scheduler demands that the download to be fully completed, which becomes another bottleneck.

Experiment

The following program represents a simple simulation of such case:

val mapped = ds

val mapped2 = mapped
  .repartition(10)
  .map { ii =>
    println(f"repartitioned - ${ii}")
    ii
  }

mapped2.foreach { _ =>
  }

When executing the above program it can be observed that line println(f"repartitioned - ${ii}") will not be executed before line println(f"skewed - ${ii}") in RDD dependency.

I'd like to instruct Spark scheduler to start distributing/shipping data entries generated by the partition iterator before its task completion (through mechanisms like microbatch or stream). Is there a simple way of doing this? E.g. converting the slow dataset into a structured stream would be nice, but there should be alternatives that are better integrated.

Thanks a lot for your opinion

UPDATE: to make your experimentation easier I have appended my scala tests that can be ran out of the box:

package com.tribbloids.spookystuff.spike

import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.scalatest.{FunSpec, Ignore}

@Ignore
class SlowRDDSpike extends FunSpec {

  lazy val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate()

  lazy val sc: SparkContext = spark.sparkContext
  lazy val sqlContext: SQLContext = spark.sqlContext

  import sqlContext.implicits._

  describe("is repartitioning non-blocking?") {

    it("dataset") {

      val ds = sqlContext
        .createDataset(1 to 100)
        .repartition(1)
        .mapPartitions { itr =>
          itr.map { ii =>
            Thread.sleep(100)
            println(f"skewed - $ii")
            ii
          }
        }

      val mapped = ds

      val mapped2 = mapped
        .repartition(10)
        .map { ii =>
          Thread.sleep(400)
          println(f"repartitioned - $ii")
          ii
        }

      mapped2.foreach { _ =>
        }
    }
  }

  it("RDD") {
    val ds = sc
      .parallelize(1 to 100)
      .repartition(1)
      .mapPartitions { itr =>
        itr.map { ii =>
          Thread.sleep(100)
          println(f"skewed - $ii")
          ii
        }
      }

    val mapped = ds

    val mapped2 = mapped
      .repartition(10)
      .map { ii =>
        Thread.sleep(400)
        println(f"repartitioned - $ii")
        ii
      }

    mapped2.foreach { _ =>
      }

  }
}

I am very hesitant suggesting it since it is built terribly but have you tried using `Spark Streaming` ? If so, what didn't work ? — sinanspd, Jun 24 '20 at 17:50
The 'stream' in my title refers to 2 implementations of Spark Streaming, and I didn't find a solution — tribbloid, Jun 25 '20 at 13:44
"the Spark coarse-grained scheduler demands that the download to be fully completed" - why is this a problem? can you please specify the data source? — Yosi Dahari, Jun 26 '20 at 08:12

Yosi Dahari · Answer 1 · 2020-06-26T10:28:59.597

First thanks for the experimentation code. This question is data source dependent (see Why information about the data source is essential section below).

That being said, the main problem here is creating more partitions while avoiding shuffle. Unfortunately repartition is one of the operations which requires shuffle.

In your example, you can increase the number of partitions without a shuffle using union.

var ds: Dataset[Int] = Seq[Int]().toDS()
val sequences = (1 to 100).grouped(10)
sequences.map(sequence => {
  println(sequence)
  sqlContext.createDataset(sequence)
}).foreach(sequenceDS =>  {
  ds = ds.union(sequenceDS)
})

Results using union dataset: Elapsed time: 24980 ms Number of partitions: 41

Without union the overall time is 34493 ms, so we are seeing significant improvement on local machine.

This avoids shuffle yet creates several connections to the given http endpoint or database connection. This is a common practice that is being used for managing parallelism.

There is no need to convert the a Dataset to streaming, as streaming works with datasets. If your data source supports streaming, you can use it to generate a Dataset without having to transition from batch to streaming. if your data source doesn't support streaming your can consider using custom receivers.

Why information about the data source is essential:

Can you control the number of partitions of an initial Dataset when reading from a given data source?
What is an acceptable request rate or number of connection to your data source?
How much data is involved? Is shuffle an option?
Does your data source support spark streaming? Some data sources (kinesis, Kafka, File systems, ElasticSearch) supports streaming and some does not.

Full logic:

  it("dataset_with_union") {
    val start = System.nanoTime()
    var ds: Dataset[Int] = Seq[Int]().toDS()
    val sequences = (1 to 100).grouped(10)
    sequences.map(sequence => {
      println(sequence)
      sqlContext.createDataset(sequence)
    }).foreach(sequenceDS =>  {
      ds = ds.union(sequenceDS)
    })

    ds.mapPartitions { itr =>
      itr.map { ii =>
        Thread.sleep(100)
        ii
      }
    }

    // Number of partitions here is 41
    println(f"dataset number or partitions: ${ds.rdd.getNumPartitions}")
    val mapped = ds

    val mapped2 = mapped
      .repartition(10)
      .map { ii =>
        Thread.sleep(400)
        println(f"repartitioned - $ii")
        ii
      }

    mapped2.foreach { _ =>
    }

    val end = System.nanoTime()
    println("Elapsed time: " + (end - start) + "ns")
  }

Can you control the number of partitions of an initial Dataset when reading from a given data source?: No, only 1 connection can be created on an executor in a task — tribbloid, Jun 27 '20 at 01:09
What is an acceptable request rate or number of connection to your data source? Just 1, and it always start from the beginning, not in the middle — tribbloid, Jun 27 '20 at 01:09
How much data is involved? Is shuffle an option? Data size is variable but generally took a few minutes for transmission. Shuffle is what I'm doing it right now but it causes too much delay — tribbloid, Jun 27 '20 at 01:10
Does your data source support spark streaming? It is g'old JDBC, I guess I can write a source V2 for it but why reinventing the wheel — tribbloid, Jun 27 '20 at 01:12
BTW, the batch-to-stream conversion is a very serious demand and is the native execution mode of Apache Flink — tribbloid, Jun 27 '20 at 01:14
Thanks for the clarification. "only 1 connection can be created on an executor in a task" - how many connections can you create globally for the entire application? "Just 1, and it always start from the beginning, not in the middle" - what are the unit of measure here? TPS? There isn't any information about the data size and not why using shuffle imposing a problem. Can you elaborate on that? In any phase I didn't suggest to write a new data source.custom receiver is something different. If batch-to-stream is a requirement (still not clear why IMO), I'm not sure spark is the right tool. — Yosi Dahari, Jun 27 '20 at 11:58
A tiny comment about using foreach to make the union: why not use some reduce method? That would be much more appropriate for scala — Juh_, Jul 08 '21 at 07:52

score 0 · Answer 2 · answered Jun 28 '20 at 12:06

I think it could be a good idea to set a queue to store the messages so that later you can add a Spark Streaming process to ingest them. If you can access your source using JDBC, why not to add a process that reads from that source and stores the data on a topic (let's say Kafka, Kinesis, SGS, ZeroMQ) so that you can connect from Spark Streaming there?. This architecture decouples both extraction and processing (as they are different things).

With that process:

you can also reuse the extracted data if you (or someone else) wants to get it for other processing
also, as only 1 connection can be set to source, you can configure different schedules on that process so that you are not necessarily consuming the connection all the time (if needed)
you reduce the pressure on source

That process can be based a daemon (for example) running in a machine that has access to both (source and Kafka for instance), so it won't consume resources on the Yarn Namenode (in case you are using Spark based on Hadoop)

Thanks a lot for the idea, I'm actually trying to implement this at the moment. My biggest concern are that the publishing task keep hogging the resource pool, doing nothing but waiting for the queue to be consumed. If this is a native feature, the publisher's resource should be handled by the stream consumer already — tribbloid, Jun 28 '20 at 19:52
I think you can use two approaches to solve that problem: 1) Implement the publisher as stream as well having a mechanism to control back-pressure (for example using akka https://doc.akka.io/docs/akka/current/stream/stream-refs.html). This will keep constant pressure on the source. 2) Or Inject a mechanism on the source that makes it isolated from publishing based on DB push (that is implementing a log shipping (for instance) as you can see on SQL Server https://learn.microsoft.com/en-us/sql/database-engine/log-shipping/about-log-shipping-sql-server?view=sql-server-ver15). — Oscar Lopez M., Jun 29 '20 at 07:29

In Apache Spark, how to convert a slow RDD/dataset into a stream?

2 Answers2