1

I have a text file in the following format.

0TYPE  DURATION    ( 1)      ( 2)      ( 3)      ( 4)      ( 5)      ( 6)      ( 7)      ( 8)      ( 9)      (10)
 ----  --------
 A     001-010     0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00   
       011-020     0.00      0.76      0.78      0.80      0.82      0.84      0.87      0.89      0.92      0.94   
       021-030     0.97      0.99      1.02     
0TYPE  DURATION    ( 1)      ( 2)      ( 3)      ( 4)      ( 5)      ( 6)      ( 7)      ( 8)      ( 9)      (10)
 ----  --------
 B     001-010     0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
       011-020     0.00      3.65      3.66  

I want to unpivot it into the following format

TYPE DURATION   RATE
A       1       0.00
A       2       0.00
A       3       0.00
...
A       23      1.02
B       1       0.00
B       2       0.00
B       3       0.00
...
B       13      3.66

For each type, unpivot the duration and rate. I tried to read and split the rate into a list, and print it with a loop. But the index will reset with each line.

How do I retain the index without knowing the number of rates for each type? Any help is appreciated!

#look for pattern 001-010
pattern = r'\d{3}-\d{3}'
if re.findall(pattern, line):
    print_list = filter(None, re.split(r'\s*', line.replace("\x00","")))  #delimiters - unicode null, space and new line
    print_str = record_type + '|' + plan_code+ '|' + issue_age+ '|' + type


    for index, value in enumerate(print_list[2:]):
        print print_str + '|' + str(index+1) + '|' + value + '\n'
qshng
  • 887
  • 1
  • 13
  • 32

1 Answers1

1

re doesn't strike me as the best solution for this type of problem. I'd just write a quick logical parser...

with open (outfile, 'w') as f_out:
    f_out.write("TYPE\tDURATION\tRATE\n") #excel likes tab delimination best, everything else tends towards comma separation.
    with open(infile) as f: 
        f_iter = iter(f) #make the iterator ahead of time so we can advance manually
        for line in f_iter: #line by line
            list_items = line.strip().split() #clean up string and split into list on whitespace
            if list_items[0] == "0TYPE": #start a new 0TYPE category
                try: #I always keep calls to next() inside a try block to catch EOF this might happen if you had an empty section
                    next(f_iter) #skip "---- ------"
                    list_items = next(f_iter).strip().split() #get next line
                except StopIteration:
                    break
                type_str = list_items.pop(0) #inefficient but simple and understandable
                duration = 1 #reset duration
            #from here list_items should always be like "xxx-xxx  x.xx   x.xx  x.xx  x.xx ..."
            list_items.pop(0) #drop 001-010, 011-020, etc.
            for item in list_items:
                f_out.write("{}\t{}\t{}\n".format(type_str, duration, item)) #tab delimination can be changed to comma or spaces
                duration += 1
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • Side reading on people debating the use cases of regex: https://stackoverflow.com/questions/764247/why-are-regular-expressions-so-controversial – Aaron Dec 12 '17 at 15:32
  • Thank you Aaron. I'm not familiar with the use of iterator. This is very helpful! – qshng Dec 12 '17 at 20:43