python - getting some special characters as a list from a complex text file

Question

I have such a string:

    <?xml version="1.0" encoding="UTF-8" ?>
    <tmx version="1.4">
    <header creationdate="Mon Jan  4 11:56:26 2016"
              srclang="en"
              adminlang="en"
              o-tmf="unknown"
              segtype="sentence"
              creationtool="Uplug"
              creationtoolversion="unknown"
              datatype="PlainText" />
      <body>
        <tu>
          <tuv xml:lang="en"><seg>Ah, this is greasy.</seg></tuv>
          <tuv xml:lang="tr"><seg>Yemek çok yağlıymış.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>I want to eat kimchee.</seg></tuv>
          <tuv xml:lang="tr"><seg>Şimdi biraz kimchi yiyebilirim.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Is Chae Yoon's coordinator in here?</seg></tuv>
          <tuv xml:lang="tr"><seg>Yune'nin stilisti, içeride misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Excuse me, aren't you Chae Yoon's coordinator? Yes. Me?</seg></tuv>
          <tuv xml:lang="tr"><seg>Sen Yune'nin stilisti değil misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>-Chae Yoon is done singing.</seg></tuv>
          <tuv xml:lang="tr"><seg>- Ben mi? - Yune şarkısını bitirdi.</seg></tuv>
        </tu>
..............................................................................

I want to get the sentences between <seg>...</seg> into a list like;

[['sentence1', 'sentence2'], ['sentence3', 'sentence4']].

How can I manage that?

Does the string has new lines as well or you just formatted the post above for readability ? — Saif Asif, Aug 23 '16 at 11:31
this is actually a .tmx file. If it is possible to parse with xml, could you write some hint? Thanks :) — yusuf, Aug 23 '16 at 11:40

score 1 · Accepted Answer · answered Aug 23 '16 at 11:41

If you want to go with a pure regex approach, you can try regex.findall to get all matches.

Although not a perfect approach, but something like

import re
regex = r'<tuv.*<seg>(.*)</seg>.*\n.*<seg>(.*)</seg></tuv>'

input_string = """
<?xml version="1.0" encoding="UTF-8" ?>
    <tmx version="1.4">
    <header creationdate="Mon Jan  4 11:56:26 2016"
              srclang="en"
              adminlang="en"
              o-tmf="unknown"
              segtype="sentence"
              creationtool="Uplug"
              creationtoolversion="unknown"
              datatype="PlainText" />
      <body>
        <tu>
          <tuv xml:lang="en"><seg>Ah, this is greasy.</seg></tuv>
          <tuv xml:lang="tr"><seg>Yemek çok yağlıymış.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>I want to eat kimchee.</seg></tuv>
          <tuv xml:lang="tr"><seg>Şimdi biraz kimchi yiyebilirim.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Is Chae Yoon's coordinator in here?</seg></tuv>
          <tuv xml:lang="tr"><seg>Yune'nin stilisti, içeride misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Excuse me, aren't you Chae Yoon's coordinator? Yes. Me?</seg></tuv>
          <tuv xml:lang="tr"><seg>Sen Yune'nin stilisti değil misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>-Chae Yoon is done singing.</seg></tuv>
          <tuv xml:lang="tr"><seg>- Ben mi? - Yune şarkısını bitirdi.</seg></tuv>
        </tu>
"""

def main():
    y = []
    for i_tuple in re.findall(regex, input_string):
        # just for the sake that you need a list, otherwise re.findall
        # already returns a list of tuples
        y.append(list(i_tuple))
    print(y)

if __name__ == '__main__':
    main()

Prints out the following on my end

[['Ah, this is greasy.', 'Yemek çok yağlıymış.'], ['I want to eat kimchee.', 'Şimdi biraz kimchi yiyebilirim.'], ["Is Chae Yoon's coordinator in here?", "Yune'nin stilisti, içeride misin?"], ["Excuse me, aren't you Chae Yoon's coordinator? Yes. Me?", "Sen Yune'nin stilisti değil misin?"], ['-Chae Yoon is done singing.', '- Ben mi? - Yune şarkısını bitirdi.']]

score 1 · Answer 2 · edited May 23 '17 at 12:22

I've quite enjoyed using Beautifulsoup for tasks like that in the past, although I've only been working with html. It does however handle xml quite well also, apparently.

Specifically, you're probably wanting to look at things like .find_all. The most important thing to realise if you want to hit the ground running with this (other than how really nice the documentation is) is that the return value of a find_all function is an object which you can call find_all on again - so you can do something like:

soup = BeautifulSoup(text)
retval = []
tus = soup.find_all('tu')
for tu in tus:
    inner = []
    tuvs = tu.find_all('tuv')
    for tuv in tuvs:
        inner.append(tuv.contents[0].text)
    retval.append(inner)

The docstrings in this module are also quite good so dir(object) and help(object), help(object.function) etc are, as always, your friends here.

I'll admit that I've tried to parse html with regex in the (distant, but not distant enough that I don't still get bad dreams sometimes) past - as mentioned in the first answer here - it is a really bad idea. I don't know whether using regex on xml is less likely to "extinguish the voices of mortal man from the sphere" or not - but do you really want to take that risk?

score 1 · Answer 3 · answered Aug 23 '16 at 12:24

Another possible approach for finding the sentences could be

s = """
<?xml version="1.0" encoding="UTF-8" ?>
    <tmx version="1.4">
    <header creationdate="Mon Jan  4 11:56:26 2016"
              srclang="en"
              adminlang="en"
              o-tmf="unknown"
              segtype="sentence"
              creationtool="Uplug"
              creationtoolversion="unknown"
              datatype="PlainText" />
      <body>
        <tu>
          <tuv xml:lang="en"><seg>Ah, this is greasy.</seg></tuv>
          <tuv xml:lang="tr"><seg>Yemek çok yağlıymış.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>I want to eat kimchee.</seg></tuv>
          <tuv xml:lang="tr"><seg>Şimdi biraz kimchi yiyebilirim.</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Is Chae Yoon's coordinator in here?</seg></tuv>
          <tuv xml:lang="tr"><seg>Yune'nin stilisti, içeride misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>Excuse me, aren't you Chae Yoon's coordinator? Yes. Me?</seg></tuv>
          <tuv xml:lang="tr"><seg>Sen Yune'nin stilisti değil misin?</seg></tuv>
        </tu>
        <tu>
          <tuv xml:lang="en"><seg>-Chae Yoon is done singing.</seg></tuv>
          <tuv xml:lang="tr"><seg>- Ben mi? - Yune şarkısını bitirdi.</seg></tuv>
        </tu>
"""

first = "<seg>"
last = "</seg>"
while first in s:
  start = s.index( first ) + len( first )
  end = s.index( last, start )
  print(s[start:end])
  s = s[end:]

Returns:

"Ah, this is greasy."
"Yemek çok yağlıymış."
"I want to eat kimchee."
"Şimdi biraz kimchi yiyebilirim."
"Is Chae Yoon's coordinator in here?"
"Yune'nin stilisti, içeride misin?"
"Excuse me, aren't you Chae Yoon's coordinator? Yes. Me?"
"Sen Yune'nin stilisti değil misin?"
"-Chae Yoon is done singing."
"- Ben mi? - Yune şarkısını bitirdi."

LukeBowl, thank you for your answer. And it works! But there is another problem now. The file is around 5 GB. Is there any fast way to do the same? Thanks, — yusuf, Aug 23 '16 at 12:32
@yusuf Python string search efficiency with splitting is discussed here http://stackoverflow.com/questions/6963236/python-string-search-efficiency. 5GB is a relatively large file, and depending on the bottlenecks there might be different approaches — LukeBowl, Aug 23 '16 at 12:37

python - getting some special characters as a list from a complex text file

3 Answers3