1

if I have a text like this

1
<src> he is a [man]</src>
<tgt>lui è un [uomo]</tgt>
2
<src> she is a [woman]</src>
<tgt>lei è una donna</tgt>
3
<src> he works well</src>
<tgt> lui lavora [bene]</tgt>

and I want to detect the strings between the brackets only if the brackets are present in the src and tgt line, so in the text above, I want to detect only [man][uomo], because in the src line there is [man] and in the tgt line there is [uomo]. Can someone help me

I tried this code

line = str()
num = str()
line1 = str()
num1 = str()

for i, line in enumerate(file):
    lines = iter(filer1)
    if line.startswith("<src>"):
        line += '%s\n' % line.strip()
        num += '%s\n' % filer1[i-1]
    if line.startswith("<tgt>"):
        line1 += '%s\n' % line.strip()
        num1 += '%s\n' % filer1[i-2]
for l in line.splitlines():
      for ll in line1.splitlines():
          for n in num.splitlines():
              for nn in num1.splitlines():
                   if n ==nn:
                      m = re.findall(r"\[(.*?)\]",l)
                      mm = re.findall(r"\[(.*?)\]",ll)
                      if m and mm:
                            print '[{}]'.format(m[0]), '[{}]'.format(mm[0])
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
sss
  • 1,307
  • 2
  • 10
  • 16

2 Answers2

1

Basically, what you should do is: first, clean up your text input so that you have a list of lists, where each sublist contains a src line and a tgt line. Then, loop over the pairs of lines, and use re to test for the presence of text within square brackets in both src and tgt. If both src and tgt have bracketed text, display them; otherwise, don't.

This should be pretty straightforward, and should look something like the below:

import re

# see <http://stackoverflow.com/a/312464/1535629>
def chunks(l, n):
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

text = '''1
<src> he is a [man]</src>
<tgt>lui è un [uomo]</tgt>
2
<src> she is a [woman]</src>
<tgt>lei è una donna</tgt>
3
<src> he works well</src>
<tgt> lui lavora [bene]</tgt>'''
lines = text.split('\n')
linepairs = [chunk[1:] for chunk in chunks(lines, 3)]

regex = re.compile(r'\[\w*\]')
for src, tgt in linepairs:
    src_match = re.search(regex, src)
    tgt_match = re.search(regex, tgt)
    if src_match and tgt_match:
        print(src_match.group(), tgt_match.group())

Result:

[man] [uomo]
senshin
  • 10,022
  • 7
  • 46
  • 59
0

Assuming that your file strictly follows the three-line pattern, you could do

# assumes Python 2.7
from itertools import izip_longest
import re

INPUT = "translations.txt"

# from itertools documentation
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

in_brackets = re.compile("\[(.*?)\]").search

def main():
    with open(INPUT) as inf:
        for num,en,it in grouper(inf, 3, ""):
            en = in_brackets(en)
            it = in_brackets(it)
            if en and it:
                print("[{}] -> [{}]".format(en.group(1), it.group(1)))

main()
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99