Import data from csv into database when not all columns are guaranteed

Question

I am trying to build an automatic feature for a database that takes NOAA weather data and imports it into our own database tables.

Currently we have 3 steps:

1. Import the data literally into its own table to preserve the original data
2. Copy it's data into a table that better represents our own data in structure
3. Then convert that table into our own data

The problem I am having stems from the data that NOAA gives us. It comes in the following format:

Station    Station_Name    Elevation    Latitude    Longitude    Date   MXPN    Measurement_Flag    Quality_Flag    Source_Flag    Time_Of_Observation   ...

Starting with MXPN (Maximum temperature for water in a pan) which for example is comprised of it's column and the 4 other columns after it, it repeats that same 5 columns for each form of weather observation. The problem though is that if a particular type of weather was not observed in any of the stations reported, that set of 5 columns will be completely omitted.

For example if you look at Central Florida stations, you will find no SNOW (Snowfall measured in mm). However, if you look at stations in New Jersey, you will find this column as they report snowfall. This means a 1:1 mapping of columns is not possible between different reports, and the order of columns may not be guaranteed.

Even worse, some of the weather types include wild cards in their definition, e.g. SN*# where * is a number from 0-8 representing the type of ground, and # is a number 1-7 representing the depth at which soil temperature was taken for the minimum soil temperature, and we'd like to collect these together.

All of these are column headers, and my instinct is to build a small Java program to map these properly to our data set as we'd like it. However, my superior believes it may be possible to have the database do this on a mass import, but he does not know how to do it.

Is there a way to do this as a mass import, or is it best for me to just write the Java program to convert the data to our format?

Systems in use: MariaDB for the database. Centos7 for the operating system (if it really becomes an issue) Java is being done with JPA and Spring Boot, with hibernate where necessary.

Do the reports also vary due to the time of year - does the NJ report skip the snow columns in August? — HLGEM, Jul 02 '18 at 18:35
Yes, in short, if nothing is reported in one of the sections during a particular window (set of stations and date range) it will not have a column present. — Chthonic One, Jul 02 '18 at 18:41
Are you looking for a way to import any weather data file into a single staging table with columns that are a superset of all data files? — rd_nielsen, Jul 02 '18 at 18:58
Yes to all those fancy words I was trying to think of when writing the question. There is one exception, though it does not have to be one, specifically the fields with format of SN*#, SX*#, WT##, and WV## which might appear multiple times, and we were hoping to compress them, but this does not need to occur in the staging table. — Chthonic One, Jul 02 '18 at 19:08
Just another thought I should clarify, there are a lot of duplicate columns. The Measurement_Flag, Quality_Flag, etc. columns are all duplicated for each of the weather types. So any solution needs to account for this in mapping them so it maps them immediately after the weather type in the form. — Chthonic One, Jul 02 '18 at 19:12
A `sed` or `awk` script could be used to eliminate the duplicate column headers by adding suffixes or prefixes as appropriate. The IMPORT metacommand of execsql.py (http://execsql.readthedocs.io/en/latest/metacommands.html#import) will import delimited text files with fewer columns than the target (i.e., staging) table. You'll need Python 2.7 to run it; use `pip install execsql` to install it. You can write the SQL to partition the data into other tables in the same script file as the IMPORT metacommand. — rd_nielsen, Jul 02 '18 at 19:26
Well, it sounds like that is similar to my idea of just having the Java program do it, which would be simpler in this context as Java is what we are using to implement the database interface right now. What my supervisor thinks is possible is to write a SQL command that tells the database itself to do the work. He doesn't know for sure if it is possible, but he believes it might be, so I am asking here. — Chthonic One, Jul 02 '18 at 19:52
If you can describe in more detail what "do the work" means, then you stand a better chance of getting a more specific answer. — rd_nielsen, Jul 02 '18 at 20:13
https://stackoverflow.com/questions/9608838/import-fixed-width-text-to-sql <-- This represents something similar to what we want. In fact we were going to use this with their Stations listing which is fixed width. The idea that he was suggesting as best as I understand it is a SQL command to take the column named X and the following 4 columns, and map it to a specific 5 columns in our master table for that field, and do that repeatedly, skipping sets that don't exist. — Chthonic One, Jul 02 '18 at 20:26

score 0 · Answer 1 · answered Jul 03 '18 at 03:43

You are creating a new table per each file.

I presume that the first 6 fields are always present, and that you have 0 or more occurrences of the next 5 fields. if you are using SQL Server i would approach it as follows,

Query the information_schema catalog to get a count of the fields in the table. If the count= 6 then no observations are present, if 11 columns ,then you have 1 observation, if 17 then you have 2 observations, etc.
Now that you know the number of observations you can write some SQL that will loop the over the observations and insert them into a child table with a link back to a parent table which has the 1st 6 fields.

apologies if my assumptions are way off.

-HTH

Thank you for you answer, while we aren't using MS SQL Server, some of your answer seemed to ring some bells in how he wanted to do this. We are currently using MariaDB which is a SQL type server, but according to him we will have to operate a little differently. — Chthonic One, Jul 05 '18 at 21:17

Import data from csv into database when not all columns are guaranteed

1 Answers1