I am very new to python and (as you will see) have lots of learning to do! Currently, I’m loading a space-separated file (not tab), with different length spaces in header (and multiple header lines) that into a data array. My goal is to create 2D matrix with columns corresponding to the header labels and data rows corresponding to the time entries for this information. The 4th line of the input text file has the list of variable names with the 6+ lines having the data.
After a lot of searching, trial and error, I’ve come up with this solution:
data = []
with open(’the_text_file.txt', 'r') as the_file:
all_data = [line.strip() for line in the_file.readlines()]
header = all_data[4]
data = all_data[6:]
data
then becomes a ‘list' object as follows.
['0.00 2017 11 21 0 30 0.0000
0.880175032068E+006 0.000000000000E+000 0.000000000000E+000 0.100000000000E+004 0.100000000000E+004 0.891160617746E+005 0.891160617746E+005 0.112213217246E+002 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.880175032068E+003 0.234412258842E+002 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 -0.990000000000E+016 -0.990000000000E+016 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000 0.000000000000E+000', '3600.00 2017 11 21 1 30 0.0000 ...]
header
is a space-separated list object as well
'Seconds YY MM DD hh mm ss MassOil VolOilBeached VolumeBeached VolumeOil Volume Area TeoricalArea Thickness MEvaporated VEvaporated FMEvaporated MDispersed VDispersed FMDispersed MSedimented VSedimented FMSedimented MDissolved VDissolved FMDissolved MChemDisp VChemDisp FMChemDisp MOilRecovered VOilRecovered FMOilRecovered MWaterContent VWaterContent Density Viscosity MBio VBio FMBio CharacteristicDiameter P_Star AnalyteMass1 AnalyteMass2 AnalyteMass3 AnalyteMass4 AnalyteMass5 AnalyteBio1 AnalyteBio2 AnalyteBio3 AnalyteBio4 AnalyteBio5'
What I want to end up with is a string array of names corresponding to the columns of the data. If I try
header_arr = []
header_arr = header.split(' ')
header_arr
Then I get
['Seconds',
'',
'',
'YY',
'',
'MM',
'',
'DD',
'',
'hh',
'',
'mm',
'',
'',
'',
'',
'',
‘’,...]
‘\t’ doesn’t split anything because it’s not tab-delimited.
I’ve referred to this post for removing the ‘’ entries from the array and have found this code to work in creating a string array:
# Order header into list array by splitting up string
header_arr = []
header_arr = header.split(' ')
# Remove emtpy entries from list
header_arr = np.asarray([x for x in header_arr if x != ''])
header_arr
with the result looking like:
array(['Seconds', 'YY', 'MM', 'DD', 'hh', 'mm', 'ss', 'MassOil',
'VolOilBeached', 'VolumeBeached', 'VolumeOil', 'Volume', 'Area',
'TeoricalArea', 'Thickness', 'MEvaporated', 'VEvaporated',
'FMEvaporated', 'MDispersed', 'VDispersed', 'FMDispersed',
'MSedimented', 'VSedimented', 'FMSedimented', 'MDissolved',
'VDissolved', 'FMDissolved', 'MChemDisp', 'VChemDisp', 'FMChemDisp',
'MOilRecovered', 'VOilRecovered', 'FMOilRecovered', 'MWaterContent',
'VWaterContent', 'Density', 'Viscosity', 'MBio', 'VBio', 'FMBio',
'CharacteristicDiameter', 'P_Star', 'AnalyteMass1', 'AnalyteMass2',
'AnalyteMass3', 'AnalyteMass4', 'AnalyteMass5', 'AnalyteBio1',
'AnalyteBio2', 'AnalyteBio3', 'AnalyteBio4', 'AnalyteBio5'],
dtype='<U22')
data
is a list of strings rather then just a string, so I added a loop:
# convert data to numpy array
data_tmp_2d = []
data_arr = np.asarray(data)
ntimes = data_array.size
for timestep in data_array[0:ntimes-1:]:
data_entry = timestep.split(' ')
data_tmp_2d = np.append(data_tmp_2d, np.asarray([x for x in data_entry if x != '']))
data_tmp_2d.shape
Note: I used “ntimes-1” because the last line in the data is a string that indicates end of file, so I want to include all lines in the file but the very last.
The result is a vector of length 9986 with times = 197, which means the indexing is off, as 9986/197 = 50.69. I haven’t been able to figure out why the resulting vector isn’t scalar-divisible by the number of entries, and I’m guessing that I’m making this way too complicated. I’ve also used up all my time resources for working on this for now and am hoping someone can give me a tip for ending up with the 2D matrix that I’m wanting to create in a better way that actually works! e.g. is there a way to do this more easily with xarray?