I have a parquet file containing multiple columns. I would like to select only a few columns out of these based on a condition on the column names that: if the column name contains a substring that is part of my predefined list of strings, I should select it. Please note that this is not as same as: Efficient way to read specific columns from parquet file in spark –. I need to select columns that even partially match any of the strings
For eg: if my predefined list of strings is
string_list= ['area','rank']
& my dataframe is :
+----------+----------+------------+--------+----------+------+
| areaname | pagerank | customerid | amount | areacode | page |
+----------+----------+------------+--------+----------+------+
| london | 345 | 22wer | 455 | 100 | 45 |
| germany | 47128 | f234 | 600 | 200 | 33 |
| lebanon | 344 | 45rtf | 700 | 230 | 43 |
+----------+----------+------------+--------+----------+------+
My expected output is:
+----------+----------+----------+
| areaname | pagerank | areacode |
+----------+----------+----------+
| london | 345 | 100 |
| germany | 47128 | 200 |
| lebanon | 344 | 230 |
+----------+----------+----------+
I have already tried reading the file first and selecting the columns later as shown below. But I do not want to read the entire file into memory at once. I want to do the select operation while reading itself
parquet_file = spark.read.format("parquet").load(parquet_path)
required_columns= ['`'+column+'`' for column in parquet_file.columns if any(item in column for item in string_list)]
parquet_file = parquet_file.select(required_columns)
I would like to know what is the most memory-efficient way to do this. Can I use parquet partitioning somehow?