I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01
should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
- SQL queries will almost always contain the time frame.
- Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
- There is a field called 'trans' in every CSV file, which contains the date and time.
- The CSV file is put under appropriate directory based on the value of 'trans' field.
- CSV files do not follow any schema. Columns may or may not be different.