2

I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).

Is there a way to specify in Spark to ignore nulls when writing?

Thanks !

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
py-r
  • 419
  • 5
  • 15
  • Does this answer your question? [how to filter out a null value from spark dataframe](https://stackoverflow.com/questions/39727742/how-to-filter-out-a-null-value-from-spark-dataframe) – Igor Dvorzhak Jan 31 '21 at 17:06
  • @Igor Dvorzhak: Thanks. I want to avoid persisting null values (within a row or a column) not excluding an entire row or column, which is what the link suggests and would means data loss. – py-r Jan 31 '21 at 20:28
  • Thank you for clarification, updated my answer. – Igor Dvorzhak Feb 01 '21 at 06:01
  • @IgorDvorzhak: Thanks. So you're suggesting to write Spark data row-by-row into BigTable while applying every time column pruning ? No batch way with some value pruning ? The hint for Parquet is welcome, but out-of-scope here as we're discussing BigTable ;) – py-r Feb 01 '21 at 15:09
  • I think that you can transform dataset in a new one with pruned columns first and after that write it in batches to HBase/BigTable – Igor Dvorzhak Feb 01 '21 at 17:55
  • Then how do you handle a sparse column - i.e. with mostly nulls except few values - in the dataset ? – py-r Feb 01 '21 at 18:16
  • Why suggestion in https://stackoverflow.com/questions/59638950/how-to-let-null-values-are-not-stored-in-hbase-in-pandas-python/59641595#59641595 will not work to handle this? – Igor Dvorzhak Feb 01 '21 at 20:35
  • Yes, exactly sth like that (sorry if you pointed to this link already). I was just hoping such function would be built-in in Spark (or Flink) connector, as it seems like an obvious case, no ? – py-r Feb 01 '21 at 20:51
  • Yeah, is seems so, but looks like HBase support is quite basic in Spark and Flink TBH, so I will not be surprised if there are no such option. – Igor Dvorzhak Feb 01 '21 at 21:01
  • How to close such question then ? You helped with a possible suggestion on how to proceed, though you're not sure of Spark support on this one. Maybe you can make this doubt clear in the answer ? Btw, where to place a feature request on the connector ? I'm still a bit confused on who is ultimately supporting it. – py-r Feb 01 '21 at 21:09
  • Good point, pre-faced my answer with "Probably" to convey this. HBase has it's own connectors reepo now (it includes Spark HBase connector), so probably this will be a best place to file a FR: https://github.com/apache/hbase-connectors, or in the corresponding component in https://issues.apache.org/jira – Igor Dvorzhak Feb 01 '21 at 23:08

1 Answers1

2

Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.

Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)

py-r
  • 419
  • 5
  • 15
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31