Spark - Wide/sparse dataframe persistence

Question

I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost):

What is the best format for such use case (HBase, Avro, Parquet, ...) ?
What should be specified Spark side to ignore nulls when writing?

Note that I've tried already Parquet and Avro with a simple df.write statement - for a df of size ca. 100x130k Parquet is performing the worst (ca. 55MB) vs. Avro (ca. 15MB). To me this suggests that ALL null values are stored.

Thanks !

@thebluephantom: Thanks for your suggestion. Any idea how to *persist* sparse vectors ? — py-r, Jan 09 '21 at 18:25

score 1 · Accepted Answer · edited Jan 11 '21 at 21:32

1

Spark to JSON / SparseVector (from thebluephantom)

In pyspark and using ml. Convert to Scala otherwise.

%python
from pyspark.sql.types import StructType, StructField, DoubleType
from pyspark.ml.linalg import SparseVector, VectorUDT

temp_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

schema = StructType([
    StructField("label", DoubleType(), False),
    StructField("features", VectorUDT(), False)
])

df = temp_rdd.toDF(schema)
df.printSchema()
df.write.json("/FileStore/V.json")


df2 = spark.read.schema(schema).json("/FileStore/V.json")
df2.show()

returns upon read:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(4,[0,2],[-1.0,0.5])|
|  0.0| (4,[1,3],[1.0,5.5])|
+-----+--------------------+

Spark to Avro / Avro2TF (from py-r)

The Avro2TF library presented in this tutorial seems to be an interesting alternative that directly leverages Avro. As a result, a sparse vector would be encoded as follows:

+---------------------+--------------------+
|genreFeatures_indices|genreFeatures_values|
+---------------------+--------------------+
|     [2, 4, 1, 8, 11]|[1.0, 1.0, 1.0, 1...|
|          [11, 10, 3]|     [1.0, 1.0, 1.0]|
|            [2, 4, 8]|     [1.0, 1.0, 1.0]|
|             [11, 10]|          [1.0, 1.0]|
|               [4, 8]|          [1.0, 1.0]|
|         [2, 4, 7, 3]|[1.0, 1.0, 1.0, 1.0]|

edited Jan 11 '21 at 21:32

py-r

419
5
15

answered Jan 09 '21 at 19:14

thebluephantom

16,458
8
40
83

Thanks - that is a real sparse format ! Do you know if it is possible to get some similar structure in Avro/Parquet, incl. metadata ? What about HBase ? Indeed I would like to be effective at query time, too ;) Thanks ! – py-r Jan 09 '21 at 20:13
avro and parquet are diametrically opposed. one is row, one is columnar. the whole idea of the spare format is to save space. – thebluephantom Jan 09 '21 at 20:28
Thanks, but are you saying that *be definition* Avro or Parquet do not support any sparse structure, i.e. must store all nulls ? What about HBase ? – py-r Jan 09 '21 at 20:44
see no issue with JSON if you have key unless you want direct access. dense vector to array for avro, not sparse though. – thebluephantom Jan 09 '21 at 20:45
sparse is more for memory machine learning. – thebluephantom Jan 09 '21 at 20:48
What about AvroTF format https://github.com/linkedin/Avro2TF ? – py-r Jan 09 '21 at 20:49
What about HBase ? – py-r Jan 09 '21 at 20:49
hard yakking bro – thebluephantom Jan 09 '21 at 20:49
Yep ! Giving you the point for json, thanks. I've started another question for HBase as I'd like to get this clarified: https://stackoverflow.com/questions/65647574/spark-hbase-sparse-content-persistence – py-r Jan 09 '21 at 21:04

Spark - Wide/sparse dataframe persistence

1 Answers1