Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.
Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?
What I've found so far: there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:
- How do I import CSV file into a MySQL table?
- Export Impala Table from HDFS to MySQL
- How to load Excel or CSV file into Firebird?
However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.
Constraints:
- Lab computer is on Ubuntu 20.04
- Ideally, I would like to avoid having to load each csv / partition manually, as I have tens of thousands of dates. I am hoping for a RDBMS which already recognizes the partitioned directory structure...
- the RDBMS itself should have a fairly recent set of functions available, including lead/lag/first/last window functions. aside from that, it needn't be too fancy.
I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.
Edit 1 As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:
- I currently have two "tables" in the directories /data/table1 and /data/table2 .
- those directories contain subdirectories corresponding to the different partitions, e.g.: /data/table1/thedate=1995 , /data/table1/thedate=1996 , etc., and the same goes for table2.
- within each subdirectory, I have a file (without an extension) that contains the CSV data, without headers.
My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data;
and drill understands that. I can then say show files in data
and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:
- running
select * from data.table1
fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1? - I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...
Edit 2 After trying a bunch of things, still no luck using text-based files, however using parquet files worked:
I can query a parquet file
I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.
I used the advice here in order to be able to query a table the usual way, i.e. without using
dir0
but using thedate. Essentially, I created a view :create view drill.test as select
dir0
as thedate, * from dfs.data/table1_parquet_partitioned
Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view
=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...