0

There is a free public data set available that has over 300 fields. I would like to create an ETL process to update and store the data locally on a mysql or sql server. Because the records are too big to fit into a database table, I will have to probably de-normalize, or perhaps normalize a different way. The following is a representation of my dataset:

+------------+---------+---------+-----+-----------+---------+
|  Column1   | Column2 | Column3 | ….  | Column319 | ColumnN |
+------------+---------+---------+-----+-----------+---------+
| HUEBERT    | ALLISON | L       | DR. |           | M.D.    |
| YOUNGMAYKA | CYNTHIA |         |     |           | P.A.    |
| DIFILIPPO  | EMIL    | A       | DR. |           | M.D.    |
| THACKER    | RICHARD | RANDALL | DR. |           | D.O.    |
|            |         |         |     |           |         |
| ROTHSTEIN  | MARK    | TERRY   | DR. |           | M.D.    |
| GIBBS      | ELMER   | RICKEY  | DR. |           | M.D.    |
| MARKMAN    | ALAN    | WILLIAM | DR. |           | M.D.    |
| TROTCHIE   | DEBBIE  | C       |     |           | APN     |
| DYSART     | STANLEY | H       |     |           | M.D.    |
|            |         |         |     |           |         |
| GRUNERT    | GEORGE  | M       | DR. |           | M.D.    |
| GOLDBERG   | STEVEN  | M       | DR. |           | M.D.    |
| DUBOSE     | JON     |         | DR. |           | MD      |
+------------+---------+---------+-----+-----------+---------+

I would like to de-normalize in this fashion by having 2 tables.

TableLeft:

+------------+---------+---------+
|  Column1   | Column2 | Column3 |
+------------+---------+---------+
| HUEBERT    | ALLISON | L       |
| YOUNGMAYKA | CYNTHIA |         |
| DIFILIPPO  | EMIL    | A       |
| THACKER    | RICHARD | RANDALL |
|            |         |         |
| ROTHSTEIN  | MARK    | TERRY   |
| GIBBS      | ELMER   | RICKEY  |
| MARKMAN    | ALAN    | WILLIAM |
| TROTCHIE   | DEBBIE  | C       |
| DYSART     | STANLEY | H       |
|            |         |         |
| GRUNERT    | GEORGE  | M       |
| GOLDBERG   | STEVEN  | M       |
| DUBOSE     | JON     |         |
+------------+---------+---------+

TableRight:

+-----+-----------+---------+
| ….  | Column319 | ColumnN |
+-----+-----------+---------+
| DR. |           | M.D.    |
|     |           | P.A.    |
| DR. |           | M.D.    |
| DR. |           | D.O.    |
|     |           |         |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
|     |           | APN     |
|     |           | M.D.    |
|     |           |         |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
| DR. |           | MD      |
+-----+-----------+---------+

The entire data set will probably be 10 gigs, or approximately 5 million rows, and in fact it might be closer to 4-6 joins to get the entire row of data.

What are the standards regarding working with data that needs to be partitioned in such a way?

You can view the excel file with the first 1000 records here.

Alex Gordon
  • 57,446
  • 287
  • 670
  • 1,062
  • 1
    I'm not suggesting you shouldn't normalize, but why do you say the records are too big to fit into a table? – Rick S Feb 17 '15 at 20:11
  • 1
    Let's be clear about this. The data is denormalized and you want to normalize it. That means breaking apart those columns into separate tables. This makes a lot of sense. The problem providing an answer is there are 302 columns which need to be divided among tables. This data should be structured just like any other collection data points. When you start seeing column names like "Healthcare Provider Taxonomy Code_1" that is a good indication it could be a second table. – Sean Lange Feb 17 '15 at 20:13
  • @RickS each RDBMS has a limit to how big a record can be – Alex Gordon Feb 17 '15 at 20:29
  • InnoDB has 1000 column limit and 8000 byte record limit (not counting blobs). You've given us no information about your data to make the same assumption. – Marcus Adams Feb 17 '15 at 20:32
  • @MarcusAdams indeed the rows are greater than 8kb – Alex Gordon Feb 17 '15 at 20:58
  • 1
    SQL Server supports 1024 columns in a table, but in your case the practical limit would be around ~335, so technically you may put all this data in one table. Here are more details: http://stackoverflow.com/questions/27791586/does-the-number-of-fields-in-a-table-affect-performance-even-if-not-referenced/27793320#27793320 Whether it is a good idea or not, depends on the data structure. Normalization doesn't mean mechanically splitting one wide table into two narrow tables. – Vladimir Baranov Feb 17 '15 at 22:49
  • 1
    You want to perform something that is usually called Vertical partitioning. Whether it is really needed in your case or not is not clear. This answer http://stackoverflow.com/questions/20388923/database-partitioning-horizontal-vs-vertical-difference-between-normalizatio provides more details and some links. – Vladimir Baranov Feb 17 '15 at 22:57
  • MySQL is not SQL-Server. A table with 300+ columns is rather wide, but no problem per se. As each record contains one person or organization with their contact data, this is no case of lacking normalization. You *could* however have one table for addresses and another for names well knowing that there will always be one or two entries per record in the main table. Thus you'd have the name structure and the address structure defined just once, which looks nicer but is not necessary. Column1, Column2, etc. are no good column names of course. – Thorsten Kettner Feb 17 '15 at 23:19

1 Answers1

1

Please read about normal forms. There are plethora of materials on the internet about this. For example this poster seems to be good overview. Or this videos: Normalisation Demonstration, Database Normalization - Explained with Examples

Generally speaking one table should be 'about' one entity such as address, person, business unit etc.

For example in this dataset columns Provider Organization Name (Legal Business Name), Provider Last Name (Legal Name), Provider First Name, Provider Middle Name, Provider Name Prefix Text, Provider Name Suffix Text, Provider Credential Text, seems to be about one entity (provider?).

Columns Provider Other Organization Name, Provider Other Organization Name Type Code, Provider Other Last Name, Provider Other First Name, Provider Other Middle Name, Provider Other Name Prefix Text, Provider Other Name Suffix Text, Provider Other Credential Text, Provider Other Last Name Type Code seems to be about second provider entity.

Columns Provider First Line Business Mailing Address, Provider Second Line Business Mailing Address, Provider Business Mailing Address City Name, Provider Business Mailing Address State Name, Provider Business Mailing Address Postal Code, Provider Business Mailing Address Country Code (If outside U.S.), Provider Business Mailing Address Telephone Number, Provider Business Mailing Address Fax Number seems to be about address entity.

Column set Provider Business Practice Location... seems to be another address.

Authorized Official Last Name, Authorized Official First Name, Authorized Official Middle Name, Authorized Official Title or Position, Authorized Official Telephone Number could be another entity.

Healthcare Provider Taxonomy Code_n, Provider License Number_n, Provider License Number State Code_n, Healthcare Provider Primary Taxonomy Switch_n where n is 1 to 15 seems to be list of instances of another entity.

The same with Other Provider Identifier_n, Other Provider Identifier Type Code_n, Other Provider Identifier State_n, Other Provider Identifier Issuer_n

and with Healthcare Provider Taxonomy Group_n - here seems to be one field entity.

Piotr Sobiegraj
  • 1,775
  • 16
  • 26