I have approximately 600 million rows of data is 157 CSV files. The data is in the following format:
A: 8 digit int
B: 64bit unsigned int
C: 140 characters long string
D: int
I will use the CSV to load data into a MySQL and HBase database. I am deciding on how to optimize the process of loading? I need help with the following queries:
- Use a single table to store all data or Shard it into multiple tables
- What Optimizations can I do to reduce load time?
- Improve overall performance of the database? Should normalize the table to store the information?
I will be using an M1.Large EC2 instance each to load the CSV into MySQL and HBase database.
============UPDATE============
I used a C3.8XLarge instance and it took 2 hours to load 20 CSV files (157 total) of 250Mb each. Eventually I had to stop it as it was taking too long. The CPU utilization was only 2% throughout the entire time period. If anyone can help, then please do!