How to select columns that contain any of the given strings as part of the column name in Pyspark

Question

I have a parquet file containing multiple columns. I would like to select only a few columns out of these based on a condition on the column names that: if the column name contains a substring that is part of my predefined list of strings, I should select it. Please note that this is not as same as: Efficient way to read specific columns from parquet file in spark –. I need to select columns that even partially match any of the strings

For eg: if my predefined list of strings is

string_list= ['area','rank']

& my dataframe is :

+----------+----------+------------+--------+----------+------+
| areaname | pagerank | customerid | amount | areacode | page |
+----------+----------+------------+--------+----------+------+
| london   |      345 | 22wer      |    455 |      100 |   45 |
| germany  |    47128 | f234       |    600 |      200 |   33 |
| lebanon  |      344 | 45rtf      |    700 |      230 |   43 |
+----------+----------+------------+--------+----------+------+

My expected output is:

+----------+----------+----------+
| areaname | pagerank | areacode |
+----------+----------+----------+
| london   |      345 |      100 |
| germany  |    47128 |      200 |
| lebanon  |      344 |      230 |
+----------+----------+----------+

I have already tried reading the file first and selecting the columns later as shown below. But I do not want to read the entire file into memory at once. I want to do the select operation while reading itself

parquet_file = spark.read.format("parquet").load(parquet_path)
required_columns= ['`'+column+'`' for column in parquet_file.columns if any(item in column for item in string_list)]
parquet_file = parquet_file.select(required_columns)

I would like to know what is the most memory-efficient way to do this. Can I use parquet partitioning somehow?

Does this answer your question? [Efficient way to read specific columns from parquet file in spark](https://stackoverflow.com/questions/48422137/efficient-way-to-read-specific-columns-from-parquet-file-in-spark) — blackbishop, Mar 10 '21 at 10:23
Thanks. But this works only if I know the column names to select in advance. I only know the substrings and want to select any column that contains any of these substrings as part of the column name — charlie_boy, Mar 10 '21 at 10:29
Scanning the column names from the parquet file does not require loading the file into memory, so your method should be fine — mck, Mar 10 '21 at 10:32
@charlie_boy It doesn't matter, only the columns that match the strings in your list will be loaded from the parquet. — blackbishop, Mar 10 '21 at 10:33
Can you tell me how I can get the column names first without loading the file? Also, When I simply passed my 'string_list' as the columns to be selected, I see that the columns are not getting selected with a partial match @blackbishop — charlie_boy, Mar 10 '21 at 10:44

How to select columns that contain any of the given strings as part of the column name in Pyspark

0 Answers0