3

There is a use case for which I have to read huge Parquet file and convert into Rocksdb binary, So I decided to use spark (because everybody is familiar with it in my team).

And from Rocksdb side I know it's not distributed and you can not parallelize.

So what I have done is that I have created multiple instances of Rocksdb parallelly using Spark for each task.

enter image description here

Now I want to combine them together. So My question is that, is it possible to combine multiple instances of Rocksdb together to create a big Rocksdb instance using some postprocessing?

Kaushal
  • 3,237
  • 3
  • 29
  • 48
  • How big are the rocksdb instances? You said below _"the data is very huge, we can not collect all the data to the driver side"_ so didn't you answer your question _"is it possible to combine multiple instances of Rocksdb together to create a big Rocksdb instance"_ already? – Jacek Laskowski Aug 09 '19 at 08:33
  • Yes @JacekLaskowski That is correct, but it can reside in a single machine with around 1TB SSD. Problem is not that, Just figuring the way so I can parallelize my processing. – Kaushal Aug 09 '19 at 08:57
  • It has been some time since estimates have been asked (now multiple times). Can we actually know a number. Also system configurations, such as cpu, memory, disk for spark as well as node where rocksdb will actually be used. – D3V Aug 09 '19 at 18:19

1 Answers1

0

Why don't you do a collectPartitions() or toLocalIterator() at the driver and process each partition ? Yes, it wont be parallel execution but you will get one consolidated db.

Also... an update... you can use SSTFileWriter as a variant of hadoopOutputFileFormat on every executor .. and rocksdb supports reading sst files.... here is the readme on it.

https://rocksdb.org/blog/2017/02/17/bulkoad-ingest-sst-file.html

  • Yes, It can be done, but the problem is that the data is very huge, we can not collect all the data to the driver side, and it will take lots of time to process because only one machine (driver) is responsible for whole processing. – Kaushal Aug 03 '19 at 08:30
  • How big of data are we talking here ? Try to even out partitions by doing a repartition, and then make sure you do a collect partition and execute batch update when inserting into the db. – Pranav Sawant Aug 06 '19 at 15:39
  • The other way you can do it is by hosting it on a remote server and then do it a http remote insert. This will need extra infra overhead but is doable. – Pranav Sawant Aug 06 '19 at 15:42