0

I have CSV files organized by date and time as follows

logs/YYYY/MM/DD/CSV files...

I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,

SELECT * from data where trans>='20170101' AND trans<'20170102';

In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?

Please note:

  • SQL queries will almost always contain the time frame.
  • Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
  • There is a field called 'trans' in every CSV file, which contains the date and time.
  • The CSV file is put under appropriate directory based on the value of 'trans' field.
  • CSV files do not follow any schema. Columns may or may not be different.
TriCore
  • 1,844
  • 1
  • 16
  • 17

1 Answers1

2

Querying using column inside the data file would not help in partition pruning.

You can use dir* variables in Drill to refer to partitions in table.

create view trans_logs_view as 
select
 `dir0` as `tran_year`,
 `dir1` as `trans_month`,
 `dir2` as `tran_date`, * from dfs.`/data/logs`;

You can query using tran_year,tran_month and tran_date columns for partition pruning.

Also see if below query helps for pruning.

select count(1)  from dfs.`/data/logs` 
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';

If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.

See below for more details.

https://drill.apache.org/docs/how-to-partition-data/

InfamousCoconut
  • 784
  • 8
  • 19
  • not sure what there is not to like about this answer, it really helped me thanks! – sg1234 Jun 17 '21 at 03:21
  • I used your advice and created a view, except I have to cast the `dir0` to an int in order to be able to use that field in joins with other tables. Do you by any chance know if that maintains partition pruning or does it break it? In case you'd like to chime in, here is my question: https://stackoverflow.com/questions/67929849/looking-for-a-non-cloud-rdbms-to-import-partitioned-tables-in-csv-format-with – sg1234 Jun 17 '21 at 04:48