0

I am very new to python and (as you will see) have lots of learning to do! Currently, I’m loading a space-separated file (not tab), with different length spaces in header (and multiple header lines) that into a data array. My goal is to create 2D matrix with columns corresponding to the header labels and data rows corresponding to the time entries for this information. The 4th line of the input text file has the list of variable names with the 6+ lines having the data.

After a lot of searching, trial and error, I’ve come up with this solution:

data = []
    with open(’the_text_file.txt', 'r') as the_file:
        all_data = [line.strip() for line in the_file.readlines()]
        header = all_data[4]
        data = all_data[6:] 

data then becomes a ‘list' object as follows.

 ['0.00 2017  11  21   0  30   0.0000                         
    0.880175032068E+006                         0.000000000000E+000                         0.000000000000E+000                         0.100000000000E+004                         0.100000000000E+004                         0.891160617746E+005                         0.891160617746E+005                         0.112213217246E+002                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.880175032068E+003                         0.234412258842E+002                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                        -0.990000000000E+016                        -0.990000000000E+016                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000                         0.000000000000E+000',  '3600.00 2017  11  21   1  30   0.0000    ...]

header is a space-separated list object as well

    'Seconds   YY  MM  DD  hh  mm       ss                                       MassOil                               VolOilBeached                               VolumeBeached                                   VolumeOil                                      Volume                                        Area                                TeoricalArea                                   Thickness                                 MEvaporated                                 VEvaporated                                FMEvaporated                                  MDispersed                                  VDispersed                                 FMDispersed                                 MSedimented                                 VSedimented                                FMSedimented                                  MDissolved                                  VDissolved                                 FMDissolved                                   MChemDisp                                   VChemDisp                                  FMChemDisp                               MOilRecovered                               VOilRecovered                              FMOilRecovered                               MWaterContent                               VWaterContent                                     Density                                   Viscosity                                        MBio                                        VBio                                       FMBio                      CharacteristicDiameter                                      P_Star                                AnalyteMass1                                AnalyteMass2                                AnalyteMass3                                AnalyteMass4                                AnalyteMass5                                 AnalyteBio1                                 AnalyteBio2                                 AnalyteBio3                                 AnalyteBio4                                 AnalyteBio5' 

What I want to end up with is a string array of names corresponding to the columns of the data. If I try

    header_arr = []
    header_arr = header.split(' ')
    header_arr 

Then I get

    ['Seconds',
     '',
     '',
     'YY',
     '',
     'MM',
     '',
     'DD',
     '',
     'hh',
     '',
     'mm',
     '',
     '',
     '',
     '',
     '',
     ‘’,...]

‘\t’ doesn’t split anything because it’s not tab-delimited.

I’ve referred to this post for removing the ‘’ entries from the array and have found this code to work in creating a string array:

    # Order header into list array by splitting up string
    header_arr = []
    header_arr = header.split(' ')
    # Remove emtpy entries from list
    header_arr = np.asarray([x for x in header_arr if x != ''])
    header_arr

with the result looking like:

    array(['Seconds', 'YY', 'MM', 'DD', 'hh', 'mm', 'ss', 'MassOil',
           'VolOilBeached', 'VolumeBeached', 'VolumeOil', 'Volume', 'Area',
           'TeoricalArea', 'Thickness', 'MEvaporated', 'VEvaporated',
           'FMEvaporated', 'MDispersed', 'VDispersed', 'FMDispersed',
           'MSedimented', 'VSedimented', 'FMSedimented', 'MDissolved',
           'VDissolved', 'FMDissolved', 'MChemDisp', 'VChemDisp', 'FMChemDisp',
           'MOilRecovered', 'VOilRecovered', 'FMOilRecovered', 'MWaterContent',
           'VWaterContent', 'Density', 'Viscosity', 'MBio', 'VBio', 'FMBio',
           'CharacteristicDiameter', 'P_Star', 'AnalyteMass1', 'AnalyteMass2',
           'AnalyteMass3', 'AnalyteMass4', 'AnalyteMass5', 'AnalyteBio1',
           'AnalyteBio2', 'AnalyteBio3', 'AnalyteBio4', 'AnalyteBio5'],
          dtype='<U22')

data is a list of strings rather then just a string, so I added a loop:

    # convert data to numpy array
    data_tmp_2d = []
    data_arr = np.asarray(data)
    ntimes = data_array.size
    for timestep in data_array[0:ntimes-1:]:
        data_entry = timestep.split(' ')
        data_tmp_2d = np.append(data_tmp_2d, np.asarray([x for x in data_entry if x != ''])) 

    data_tmp_2d.shape

Note: I used “ntimes-1” because the last line in the data is a string that indicates end of file, so I want to include all lines in the file but the very last.

The result is a vector of length 9986 with times = 197, which means the indexing is off, as 9986/197 = 50.69. I haven’t been able to figure out why the resulting vector isn’t scalar-divisible by the number of entries, and I’m guessing that I’m making this way too complicated. I’ve also used up all my time resources for working on this for now and am hoping someone can give me a tip for ending up with the 2D matrix that I’m wanting to create in a better way that actually works! e.g. is there a way to do this more easily with xarray?

  • Given that I’m showing names of variables that include “oil,” and that this word is non-neutral, I will simply add that this is for a not-for-profit research project to model the impacts of oil spills on an oceanic environment. I typically work with matlab, and am working on learning learn a new tool (python) for reviewing results. Thanks for your help! – ocean pillar Jan 03 '20 at 23:28
  • Short of outright cursing, I don't think it really matters what your column names are. At the same time, you should probably always attempt to post fake data that illustrates your issue, with generic column names, and randomly generated data. – Mad Physicist Jan 03 '20 at 23:35
  • Also, nice question. It was remarkably hard (took > 2min) to find a good duplicate. – Mad Physicist Jan 03 '20 at 23:36
  • For a start I'd try `np.genfromtxt` with a big enough `skip_header` value to skip the header line(s). If the file is clean enough the result should be a 2d array of floats. The variable width delimiters shouldn't be a problem, since it's the default 'white-space'. – hpaulj Jan 04 '20 at 00:42
  • An alternative is a 1d structured array, with field names derived from the header. I think you can use `skip_header=3`, `names=True` and `dtype=None`, to get the names from the 4th line. I'm making some guesses regarding confusing parts of your description. – hpaulj Jan 04 '20 at 00:45
  • 1
    Thank you @hpaulj! Your solution was right on. That’s just the kind of tip that I needed! Very much appreciated. – ocean pillar Jan 07 '20 at 18:31

0 Answers0