0

I have a file like this, containing sentences, marked as BOS (Begin Of Sentence) and EOS (End Of Sentence):

BOS 1
1 word \t\t word \t word \t\t word \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t word \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t word \t 789
EOS 2

And a second file, where the first number shows the sentence number:

1, 123, 567
2, 789

What I want is to read the first and the second file and check if the numbers at the end of every line occur in the second file. If so, I want to change only the fourth word in the line of the first file. So, the expected output is:

BOS 1
1 word \t\t word \t word \t\t NEW_WORD \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t NEW_WORD \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t NEW_WORD \t 789
EOS 2

First of all, I'm not sure how to read the two files, because they have a different number of lines. Then, I don't know how to iterate over the lines e.g. of the first sentence in the first file and at the same time iterate over the values in first line of the second file to compare. This is what I have so far:

def readText(filename1, filename2):
  data1 = open(filename1).readlines()   # the first file

  data2 = open(filename2).readlines() # the second one

  list2 = [] # a list to store the values of the second file

  for line1, line2 in itertools.izip(data1, data2):
    l1 = line1.split()

    l2 = line2.split(', ')

    find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number

    for l in l2:
      list2.append(l)

    for match in find:
      m = match.split() # split the lines of the first file

      if (m[0] == list2[0]): # for the same sentence number in the two files 
        result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1) 

if len(sys.argv)==3: 
  lines = readText(sys.argv[1], sys.argv[2])
else:
  print("file.py inputfile1 inputfile2")

Thanks in advance for any help!

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
isa
  • 13
  • 4

1 Answers1

0

For reference I name the first file as source.txt, second file as control.txt and the output as result.txt.
Here's the skeleton of the program.

[modify_line(line) if line[0].isdigit() else line for line in source]

This code passes each line intact or modified. If a line starts with a digit it's passed to modify_line which returns the line modified or the original line based on the line passed to it and some input it gets from control.txt.
modify_line has to get data from control.txt to check and alter each line passed to it. The data are line-starting number and ending numbers, say, [1, (123, 567)]. If the starting number is matched and one of the ending numbers matched, the line is altered. If the starting number doesn't match, the next line-starting number is read from control file because modify_line is passed only lines beginning with a number.
To keep state, I used closure here.

import re

def create_line_modification_function(fp, replacement_word):

    def get_line_number_and_end_numbers():
        for line in fp:
            if line.strip():
                line_number, rest = line.split(',', 1)
                line_number = line_number.strip()
                ends = [end.strip() for end in rest.split(',')]
                yield line_number, ends

    generate_line_numbers_and_ends = get_line_number_and_end_numbers()
    # modify_line needs to change this. So this is in a list
    line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None)))
    # for safety check if we run out of line numbers in the control file
    if line_number_and_ends[0] is None:
        raise ValueError('{} reached EOF'.format(fp.name))
    # for optimization compile once here
    pattern = re.compile(r'(.*)word(.*\d{3}$)')


    def modify_line(line):
        while True:
            # for convenience unpack the list 
            line_number, ends = line_number_and_ends
            if line.startswith(line_number):
                for end in ends:
                    if line.rstrip().endswith(end):
                        return pattern.sub(r'\1{}\2'.format(replacement_word), line)
                return line
            # If we are here the line numbers from control.txt and source.txt don't match.
            # So we have to read next line from control file
            line_number_and_ends[0], line_number_and_ends[1]  = next(generate_line_numbers_and_ends, (None, None))
            if line_number_and_ends[0] is None:
                raise ValueError('{} reached EOF'.format(fp.name))

    return modify_line

if __name__ == '__main__':

    with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target:
        modify_line = create_line_modification_function(ctl, 'NEW_WORD')
        target.writelines(modify_line(line) if line[0].isdigit() else line for line in source)
Nizam Mohamed
  • 8,751
  • 24
  • 32