I have a highly unstructured file of text data with records that usually span multiple input lines.
- Every record has the fields separated by spaces, as for normal text, so every field must be recognized by additional info rather than a "csv field separator".
- Many different records also share the first two fields which are:
- the number of the month day (1 to 31);
- the first three letters of the Month.
- But I know that this "special" record with the day-of-month field and month-prefix field is followed by records related to the same "timestamp" (day/month) that do not contain that info.
- I know for sure that the third field is related to unstructured sentences of many words like "operation performed with this tool on that place for this reason"
- I know that every record can have one or two numeric fields as last fields.
- I also know that every new record starts with a new line (both the first record of the day/month and the following records of the same day/month).
So, to summarize, every record should be transformed into a CSV record similar to this structure: DD,MM,Unstructured text bla bla bla,number1,number2
An example of the data is the following:
> 20 Sep This is the first record, bla bla bla 10.45
> Text unstructured
> of the second record bla bla
> 406.25 10001
> 6 Oct Text of the third record thatspans on many
> lines bla bla bla 60
> 28 Nov Fourth
> record
> 27.43
> Second record of the
> day/month BUT the fifth record of the file 500 90.25
I developed the following parser in Python but I can not figure out how to read multiple lines of the input file to logically treat them as a unique piece of information. I think I should use two loops one inside the other, but I can not deal with loop indexes.
Thanks a lot for the help!
# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...
import sys
days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
csv_separator = '|'
def is_month(s):
if s in months_in_year:
return True
else:
return False
def is_day_in_month(n_int):
try:
if int(n_int) in days_in_month:
return True
else:
return False
except ValueError:
return False
#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file
counter = 0
for line in file_in:
counter = counter + 1
line_arr = line.split()
date_str = ''
if is_day_in_month(line_arr[0]):
if len(line_arr) > 1 and is_month(line_arr[1]):
# Date!
num_month = months_in_year.index(line_arr[1]) + 1
date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
elif len(line_arr) > 1:
# No date, but first number less than 31 (number of days in a month)
date_str = ' '.join(line_arr) + csv_separator
else:
# No date, and there is only a number less than 31 (number of days in a month)
date_str = line_arr[0] + csv_separator
else:
# there is not a date (a generic string, or a number higher than 31)
date_str = ' '.join(line_arr) + csv_separator
print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)
file_in.close()
file_out.close()