There is a free public data set available that has over 300 fields. I would like to create an ETL process to update and store the data locally on a mysql or sql server. Because the records are too big to fit into a database table, I will have to probably de-normalize, or perhaps normalize a different way. The following is a representation of my dataset:
+------------+---------+---------+-----+-----------+---------+
| Column1 | Column2 | Column3 | …. | Column319 | ColumnN |
+------------+---------+---------+-----+-----------+---------+
| HUEBERT | ALLISON | L | DR. | | M.D. |
| YOUNGMAYKA | CYNTHIA | | | | P.A. |
| DIFILIPPO | EMIL | A | DR. | | M.D. |
| THACKER | RICHARD | RANDALL | DR. | | D.O. |
| | | | | | |
| ROTHSTEIN | MARK | TERRY | DR. | | M.D. |
| GIBBS | ELMER | RICKEY | DR. | | M.D. |
| MARKMAN | ALAN | WILLIAM | DR. | | M.D. |
| TROTCHIE | DEBBIE | C | | | APN |
| DYSART | STANLEY | H | | | M.D. |
| | | | | | |
| GRUNERT | GEORGE | M | DR. | | M.D. |
| GOLDBERG | STEVEN | M | DR. | | M.D. |
| DUBOSE | JON | | DR. | | MD |
+------------+---------+---------+-----+-----------+---------+
I would like to de-normalize in this fashion by having 2 tables.
TableLeft:
+------------+---------+---------+
| Column1 | Column2 | Column3 |
+------------+---------+---------+
| HUEBERT | ALLISON | L |
| YOUNGMAYKA | CYNTHIA | |
| DIFILIPPO | EMIL | A |
| THACKER | RICHARD | RANDALL |
| | | |
| ROTHSTEIN | MARK | TERRY |
| GIBBS | ELMER | RICKEY |
| MARKMAN | ALAN | WILLIAM |
| TROTCHIE | DEBBIE | C |
| DYSART | STANLEY | H |
| | | |
| GRUNERT | GEORGE | M |
| GOLDBERG | STEVEN | M |
| DUBOSE | JON | |
+------------+---------+---------+
TableRight:
+-----+-----------+---------+
| …. | Column319 | ColumnN |
+-----+-----------+---------+
| DR. | | M.D. |
| | | P.A. |
| DR. | | M.D. |
| DR. | | D.O. |
| | | |
| DR. | | M.D. |
| DR. | | M.D. |
| DR. | | M.D. |
| | | APN |
| | | M.D. |
| | | |
| DR. | | M.D. |
| DR. | | M.D. |
| DR. | | MD |
+-----+-----------+---------+
The entire data set will probably be 10 gigs, or approximately 5 million rows, and in fact it might be closer to 4-6 joins to get the entire row of data.
What are the standards regarding working with data that needs to be partitioned in such a way?
You can view the excel file with the first 1000 records here.