4

I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Apache Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.

I have been trying Ranger through Hortoworks HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.

I have also been able to devise a solution using Apache Drill and views, however, it is not acceptable right now mainly because of the still scarce community support for Drill.

Has anyone faced the same requirement and/or have some directions for a solution?

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
Felipe Martins Melo
  • 1,323
  • 11
  • 15

1 Answers1

2

After a great deal of research, I've come to a conclusion that this is not possible.

The way Ranger works with other tools (HDFS, Hive, HBase, etc) is by using plug-ins that implements hooks provided by those tools. For instance, to create a custom plug-in to secure Hive, one needs to create a HiveAuthorizer through the HiveAuthorizerFactory. But there's no such a hook for Parquet as it is nothing more than a file format.

A possible solution that would allow to secure Parquet files at a column-wise level from Ranger is to create an extension for Ranger's HDFS plugin. This extension would implement the access rules for Parquet files defined through Ranger. That way, we could seamlessly secure Parquet files the same way we do for Hive or HBase as long as the files are stored in HDFS.

Felipe Martins Melo
  • 1,323
  • 11
  • 15