Reading records spread across multiple input lines in Python

Question

I have a highly unstructured file of text data with records that usually span multiple input lines.

Every record has the fields separated by spaces, as for normal text, so every field must be recognized by additional info rather than a "csv field separator".
Many different records also share the first two fields which are:
- the number of the month day (1 to 31);
- the first three letters of the Month.
But I know that this "special" record with the day-of-month field and month-prefix field is followed by records related to the same "timestamp" (day/month) that do not contain that info.
I know for sure that the third field is related to unstructured sentences of many words like "operation performed with this tool on that place for this reason"
I know that every record can have one or two numeric fields as last fields.
I also know that every new record starts with a new line (both the first record of the day/month and the following records of the same day/month).

So, to summarize, every record should be transformed into a CSV record similar to this structure: DD,MM,Unstructured text bla bla bla,number1,number2

An example of the data is the following:

> 20 Sep This is the first record, bla bla bla 10.45 
> Text unstructured
> of the second record bla bla
> 406.25 10001 
> 6 Oct Text of the third record thatspans on many 
> lines bla bla bla 60 
> 28 Nov Fourth 
> record 
> 27.43 
> Second record of the
> day/month BUT the fifth record of the file 500 90.25

I developed the following parser in Python but I can not figure out how to read multiple lines of the input file to logically treat them as a unique piece of information. I think I should use two loops one inside the other, but I can not deal with loop indexes.

Thanks a lot for the help!

# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...

import sys

days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

csv_separator = '|'

def is_month(s):
    if s in months_in_year:
        return True
    else:
        return False 


def is_day_in_month(n_int):
    try:
        if int(n_int) in days_in_month:
            return True
        else:
            return False
    except ValueError:
        return False

#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file

counter = 0
for line in file_in:
    counter = counter + 1
    line_arr = line.split()
    date_str = ''
    if is_day_in_month(line_arr[0]):
        if len(line_arr) > 1 and is_month(line_arr[1]):
            # Date!
            num_month = months_in_year.index(line_arr[1]) + 1
            date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
        elif len(line_arr) > 1:
            # No date, but first number less than 31 (number of days in a month)
            date_str = ' '.join(line_arr) + csv_separator
        else:
            # No date, and there is only a number less than 31 (number of days in a month)
            date_str = line_arr[0] + csv_separator
    else:
        # there is not a date (a generic string, or a number higher than 31)
        date_str = ' '.join(line_arr) + csv_separator
    print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)

file_in.close()
file_out.close()

You might also find this helpful: http://stackoverflow.com/questions/42950/get-last-day-of-the-month-in-python — Silas Ray, Mar 12 '12 at 19:53
So each line is a record, correct? Can you rely on there not being numeric characters in the text block of the log? — Silas Ray, Mar 12 '12 at 19:58
You should look into the pyparsing (http://pyparsing.wikispaces.com/) module. — Hooked, Mar 12 '12 at 20:46
@sr2222 a line should be an output record for sure only if it starts with a "day month". I recognize a new output record when in the last part of the previous line in input I have a float number (at least one, at most two float numbers). Thanks for the link! — TPPZ, Mar 13 '12 at 20:55

score 2 · Accepted Answer · edited May 23 '17 at 11:48

2

You could use something like this to reformat the input text. The code most likely could use some clean up based on what is allowable in your input.

list = file_in.readlines()
list2 = []     
string =""
i = 0

while i < len(list):
   ## remove any leading or trailing white space then split on ' '
   line_arr = list[i].lstrip().rstrip().split(' ')

You might need to change this part, because here I assume that a record has to end in at least one number. Also some people frown upon try/except being used like this. (This part is from How do I check if a string is a number (float) in Python? )

   ##check for float at end of line
   try:
      float(line_arr[-1])
   except ValueError:
      ##not a float 
      ##remove new line and add to previous line
      string = string.replace('\n',' ') +  list[i]
   else:
      ##there is a float at the end of current line
      ##add to previous then add record to list2
      string = string.replace('\n',' ') +  list[i]
      list2.append(string)
      string = ""
   i+=1

The output from this added to your code is:

20/09/2011||line_number_1
Text unstructured of the second record bla bla 406.25 10001||line_number_2
06/10/2011||line_number_3
28/11/2011||line_number_4
Second record of the day/month BUT the fifth record of the file 500 90.25||line_number_5

I think this is close to what you are looking for.

edited May 23 '17 at 11:48

Community

1
1

answered Mar 12 '12 at 21:35

malbani

136
4

thanks for your reply because it is driving me to the correct answer analyzing also the bottom of a "possible output record". I can not figure out how to work with both `code`list= file_in.readline()`code` and `code`for line in file_in: counter = counter + 1 line_arr = line.split()`code` using loop indexes like `code`line_arr[i]`code`. This is why I think I should use an inner loop (dealing with indexes like i++) inside an external loop. Can you give me more details? – TPPZ Mar 13 '12 at 21:00
sorry I am not completely sure what you are asking. If you wanted to use the code I provided you would put it before counter = 0 in your code and then change your for loop to for line in list2: that should give you the above output. – malbani Mar 13 '12 at 21:08
Sorry I was thinking about inner loops, then I reorganized the code as you suggested me correctly putting your while loop before my counter = 0. Now everything is working. Thanks again! – TPPZ Mar 14 '12 at 19:14
@TPPZ no problem you could do it with inner loops, but I think it would get ugly – malbani Mar 14 '12 at 20:23

Bill Bell · Answer 2 · 2012-03-15T23:52:12.887

I believe this is a solution that uses some of the essentials of your approach. When it recognises a date it lops it off the beginning of the line and saves it for subsequent use. Similarly it lops numeric items from the right ends of lines when they are present leaving the unstructured text.

lines = '''\
20 Sep This is the first record, bla bla bla 10.45 
Text unstructured
of the second record bla bla
406.25 10001 
6 Oct Text of the third record thatspans on many 
lines bla bla bla 60 
28 Nov Fourth 
record 
27.43 
Second record of the
day/month BUT the fifth record of the file 500 90.25'''

from string import split, join

days_in_month = [ str ( item ) for item in range ( 1, 31 ) ]
months_in_year = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]

lines = [ line . strip ( ) for line in split ( lines, '\n' ) if line ]

previous_date = None
previous_month = None
for line in lines :
    item = split ( line )
    #~ print item
    if len ( item ) >= 2 and item [ 0 ] in days_in_month and item [ 1 ] in months_in_year :
        previous_date = item [ 0 ] 
        previous_month = item [ 1 ] 
        item . pop ( 0 )
        item . pop ( 0 )
    try :
        number_2 = float ( item [ -1 ] )
        item . pop ( -1 )
    except :
        number_2 = None
    number_1 = None
    if not number_2 is None :
        try :
            number_1 = float ( item [ -1 ] )
            item . pop ( -1 )
        except :
            number_1 = None
    if number_1 is None and not number_2 is None :
        number_1 = number_2
        number_2 = None
    if number_1 and number_1 == int ( number_1 ) : number_1 = int ( number_1 )
    if number_2 and number_2 == int ( number_2 ) : number_2 = int ( number_2 )
    print previous_date, previous_month, join ( item ), number_1, number_2

thanks for the code, but I am a bit confused. I unsuccessfully tried to reformat it properly under the Python rules! — TPPZ, Mar 13 '12 at 23:07
I think I have just, finally learned how to put formatted code into an answer. Sorry it was so confusing. — Bill Bell, Mar 15 '12 at 23:53

Reading records spread across multiple input lines in Python

2 Answers2