0

I have a directory of over 10,000 large csv files (~200-600 MB per csv file) that were all compressed into individual tar.gz files (~40-80 MB per tar.gz file). The csv files were created through a script that pulls weather data for the last 8 years from an API. Over those 8 years there were some new attributes (columns) added, but the script should have added those columns and filled with nulls.

Each csv file should have the same column headers, but they may be in different orders.

I need to review the data for consistency and ensure that all of the data was pulled into the csv correctly, and that the tar process did not corrupt any of the files. To do this, my idea was to create one table with all of the file names, the columns, and the counts of null values in each column.

I need to do this using Python, but can utilize Anaconda, Jupyter Notebook, or any other IDE or terminal.

Here was my first though on how to do this, but I bet there is a better/faster way to accomplish this.

- create a master_df with all of the column names plus a column for the file name
- create a loop
    - extract the csv from the tar
    - load the csv into a pandas df
    - add the filename as the first column
    - *somehow count the number of null values for each row
    - save the null counts into the master_df
- compare the rows for any discrepancies

Thank you for your ideas!!!

master_file (one row for each csv file):

filename,U component of wind,latitude,longitude,V component of wind,Wind speed (gust),Temperature,Relative humidity,Ozone mixing ratio,Absolute vorticity,Cloud mixing ratio,Haines Index,Snow depth,2 metre temperature,2 metre dewpoint temperature,Apparent temperature,Percent frozen precipitation,Sunshine Duration,Convective available potential energy,Precipitable water,Cloud water,Total ozone,Storm relative helicity,U-component storm motion,V-component storm motion,Vertical speed shear,Pressure,Specific humidity,Pressure reduced to MSL,timestamp,Rain mixing ratio,Categorical snow,Categorical rain,Ice water mixing ratio,Total Cloud Cover,High cloud cover,Low cloud cover,Precipitation rate,Vegetation,Visibility
gfs.0p25.2016070300.f000.grib2.csv.tar.gz,0,0,0,0,0,0,0,0,0,0,1038242,1038242,0,0,0,0,1038242,0,0,0,0,1038242,1038242,1038242,1038242,1038242,1038242,1038242,0,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242
gfs.0p25.2017081118.f000.grib2.csv.tar.gz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,0,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242
gfs.0p25.2017091318.f000.grib2.csv.tar.gz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242,1038242
gfs.0p25.2021111706.f000.grib2.csv.tar.gz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1038242,1038242,0,1038242,0
gfs.0p25.2022070418.f000.grib2.csv.tar.gz,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

csv file example (rows greatly condensed):

U component of wind,latitude,longitude,V component of wind,Wind speed (gust),Temperature,Relative humidity,Ozone mixing ratio,Absolute vorticity,Cloud mixing ratio,Haines Index,Snow depth,2 metre temperature,2 metre dewpoint temperature,Apparent temperature,Percent frozen precipitation,Sunshine Duration,Convective available potential energy,Precipitable water,Cloud water,Total ozone,Storm relative helicity,U-component storm motion,V-component storm motion,Vertical speed shear,Pressure,Specific humidity,Pressure reduced to MSL,timestamp,Rain mixing ratio,Categorical snow,Categorical rain,Ice water mixing ratio,Total Cloud Cover,High cloud cover,Low cloud cover,Precipitation rate,Vegetation,Visibility
-2.068722534,3.25,109.75,-0.593682861,2.112927246,186.1549609,0,2.94E-07,-1.70E-05,0,--,--,302.2038818,296.9999878,306.7051636,-50.0000061,6609,1479,51.48387609,0,263.386084,3.693237305,-2.701022949,4.627927246,0.014727142,20646.275,2.94E-06,100593,7/3/2022 12:00,0,0,0,0,0,100,0,0,--,24135.00724
-2.068722534,3.25,110,-1.093682861,2.212927246,186.2349609,0,2.94E-07,-3.00E-05,0,--,--,302.2038818,297.0999878,306.7051636,-50.0000061,6600,1512,51.09347609,0,263.586084,4.693237305,-2.521022949,4.537927246,0.013727142,20175.875,2.94E-06,100590.2,7/3/2022 12:00,0,0,0,0,0,57.4,0,0,--,24135.00724
-1.868722534,3.25,110.25,-1.393682861,2.212927246,186.3249609,0,2.94E-07,-3.60E-05,0,--,--,302.2038818,296.9999878,306.6051636,-50.0000061,6518,1470,50.17507609,0,263.786084,0.693237305,-1.721022949,4.717927246,0.005727142,19412.675,2.94E-06,100583,7/3/2022 12:00,0,0,0,0,0,5.6,0,0,--,24135.00724
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • 1
    It's not clear that pandas is really helping you here, but the fundamental loop is fine. – Tim Roberts Dec 15 '22 at 21:51
  • 1
    If you are using Python, have a look at the [tarfile](https://docs.python.org/3/library/tarfile.html) module that lets you read files directly from it without extracting them first. There seems to be a [csv](https://docs.python.org/3/library/csv.html) module as well that you can feed the extracted data into. – nafmo Dec 15 '22 at 21:56
  • You can also target specific files in the tarfile if there happen to be more than needed: https://stackoverflow.com/a/2018576/868044 – Dan Dec 15 '22 at 23:11
  • Please [edit] the question to show different examples of the headers and what your expected output format should be – Martin Evans Dec 16 '22 at 09:07
  • @MartinEvans I edited the question for much greater detail, modified the pseudo code, and added an example csv file and an example output table. – Jason Rogers Dec 16 '22 at 15:59

0 Answers0