I have a file like this, containing sentences, marked as BOS (Begin Of Sentence) and EOS (End Of Sentence):
BOS 1
1 word \t\t word \t word \t\t word \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t word \t 567
EOS 1
BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t word \t 789
EOS 2
And a second file, where the first number shows the sentence number:
1, 123, 567
2, 789
What I want is to read the first and the second file and check if the numbers at the end of every line occur in the second file. If so, I want to change only the fourth word in the line of the first file. So, the expected output is:
BOS 1
1 word \t\t word \t word \t\t NEW_WORD \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t NEW_WORD \t 567
EOS 1
BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t NEW_WORD \t 789
EOS 2
First of all, I'm not sure how to read the two files, because they have a different number of lines. Then, I don't know how to iterate over the lines e.g. of the first sentence in the first file and at the same time iterate over the values in first line of the second file to compare. This is what I have so far:
def readText(filename1, filename2):
data1 = open(filename1).readlines() # the first file
data2 = open(filename2).readlines() # the second one
list2 = [] # a list to store the values of the second file
for line1, line2 in itertools.izip(data1, data2):
l1 = line1.split()
l2 = line2.split(', ')
find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number
for l in l2:
list2.append(l)
for match in find:
m = match.split() # split the lines of the first file
if (m[0] == list2[0]): # for the same sentence number in the two files
result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1)
if len(sys.argv)==3:
lines = readText(sys.argv[1], sys.argv[2])
else:
print("file.py inputfile1 inputfile2")
Thanks in advance for any help!