0

I have approximately 600 million rows of data is 157 CSV files. The data is in the following format:

A: 8 digit int
B: 64bit unsigned int
C: 140 characters long string
D: int

I will use the CSV to load data into a MySQL and HBase database. I am deciding on how to optimize the process of loading? I need help with the following queries:

  1. Use a single table to store all data or Shard it into multiple tables
  2. What Optimizations can I do to reduce load time?
  3. Improve overall performance of the database? Should normalize the table to store the information?

I will be using an M1.Large EC2 instance each to load the CSV into MySQL and HBase database.

============UPDATE============
I used a C3.8XLarge instance and it took 2 hours to load 20 CSV files (157 total) of 250Mb each. Eventually I had to stop it as it was taking too long. The CPU utilization was only 2% throughout the entire time period. If anyone can help, then please do!

AngryPanda
  • 1,261
  • 2
  • 19
  • 42

1 Answers1

0

For HBase, you can use the standard CSV Bulk Load For MySQL, you will have to use regular CSV MySQL Load

And Normalizing the data is upto you. Looking at your data structure, I think you probably don't need normalization.

Community
  • 1
  • 1
ThePatelGuy
  • 1,844
  • 1
  • 19
  • 18