Can i filter the files(filenames) from which i wanted to create a hive table in databricks?

Question

I have serverlogs enabled on s3 bucket. The log files have names as:

2023-02-16-00-16-16-A4210A3BBB675006. The first part of the filename is the date. And i extract various fields from the contents of the file using regex serde and create a hive table. Now I wanted to create the table for the files from any particular day say 2023-02-16. I have tried many ways but was not able to do so.

I specified this where condition : WHERE input_file_name() LIKE 's3://<path>/s3_logs/2023-02-16%'. But was thrown error. And i also tried this: LOCATION 's3:///s3_logs/2023-02-16%' , which also throwed an error.

I read somewhere that it is not possible to do so. Even if we specify the input_file_name() , the databricks parse every file and just filter the files which match our pattern, which doesn't solve my problem. I wanted to do this to reduce the computation complexity by restricting the data sources to a specific pattern. Can anyone help me with this? Thank you!!

score 0 · Answer 1 · answered Mar 21 '23 at 06:11

I got the answer. In SQL, while creating external tables, we use LOCATION. That keyword doesn't accept regular expressions. It accepts something called GLOBBS which is pattern matching mechanism in Shell or Bash script. So, s3://<path>/s3_logs/2023-02-16* does my work. Reference: https://linuxhint.com/bash_globbing_tutorial/#:~:text=The%20Bash%20shell%20feature%20that,characters%20to%20create%20the%20pattern.

Can i filter the files(filenames) from which i wanted to create a hive table in databricks?

1 Answers1