TMX(Translation Memory eXchange) files in python

Question

Is there a module for handling TMX(Translation Memory eXchange) files in python, if not, what would be another way to do it?

As it stands, I have a giant 2gb file with French-English subtitles. Would it be possible to even handle such a file or would I have to break it down?

Of course TMX simply is XML. 2GB would give ~1GB data, so instead of a map French-English, importing all into a database would make sense. — Joop Eggen, Jul 09 '19 at 10:25
Here is the Complete Solution: https://softans.com/question/tmxtranslation-memory-exchange-files-in-python/#comment-514 — GHULAM NABI, May 19 '23 at 07:16

Anwarvic · Answer 1 · 2021-03-17T09:12:18.667

7

As @hurrial said, you can use translate-toolkit.

Install

This toolkit is only available using pip. To install it, run:

pip install translate-toolkit

Usage

Assume that you have the following simple sample.tmx file:

<tmx version="1.4">
  <header
    creationtool="XYZTool" creationtoolversion="1.01-023"
    datatype="PlainText" segtype="sentence"
    adminlang="en-us" srclang="en"
    o-tmf="ABCTransMem"/>
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>Hello world!</seg>
      </tuv>
      <tuv xml:lang="ar">
        <seg>اهلا بالعالم!</seg>
      </tuv>
    </tu>
  </body>
</tmx>

You can parse this simple file like so:

>>> from translate.storage.tmx import tmxfile
>>>
>>> with open("sample.tmx", 'rb') as fin:
...     tmx_file = tmxfile(fin, 'en', 'ar')
>>>
>>> for node in tmx_file.unit_iter():
...     print(node.source, node.target)
Hello world! اهلا بالعالم!

For more info, check the official documentation from here.

edited Mar 17 '21 at 09:12

answered Jul 09 '19 at 10:19

Anwarvic

12,156
4
49
69

Not working here with `python3`: [code] >>> from translate.storage.tmx import tmxfile\\ Traceback (most recent call last): File "", line 1, in File "/home/souto/.local/share/virtualenvs/folder-WrAyGpIU/lib/python3.5/site-packages/translate/storage/tmx.py", line 22, in from lxml import etree ImportError: No module named 'lxml'[/code] – msoutopico Aug 30 '19 at 16:13
1

Use `pip install lxml` to install `lxml` library. – Anwarvic Aug 30 '19 at 16:17
Getting the error: 'tmxunit' object has no attribute 'getsource' – Aditya Landge Mar 17 '21 at 06:41
2

@AdityaLandge, apparently they changed the API. Anyway, I've updated my answer to use `node.source` instead of `node.getsource()` and `node.target` instead of `node.gettarget()`. – Anwarvic Mar 17 '21 at 09:13

score 2 · Answer 2 · edited Nov 11 '16 at 08:03

2

You may check the following links:

pretranslate: http://translate-toolkit.readthedocs.org/en/latest/commands/pretranslate.html
Translate toolkit: http://en.wikipedia.org/wiki/Translate_Toolkit
Translate toolkit package: https://pypi.python.org/pypi/translate-toolkit
Translate API: https://github.com/translate/translate

Cheers,

edited Nov 11 '16 at 08:03

Roman Imankulov

8,547
1
19
14

answered Sep 11 '14 at 09:05

hurrial

484
4
9

score 0 · Answer 3 · answered May 11 '23 at 00:25

Here's a script that can easily convert TMX to pandas dataframe:

from collections import namedtuple
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup

def tmx2df(tmxfile):
    # Pick your poison for parsing XML.
    with open(tmxfile) as fin:
        content = fin.read()
        bsoup = BeautifulSoup(content, 'lxml')    # Actual TMX extraction.
    lol = [] # Keep a list of the rows to populate.
    for tu in tqdm(bsoup.find_all('tu')):
        # Parse metadata from tu
        metadata = tu.attrs
        # Parse prop
        properties = {prop.attrs['type']:prop.text for prop in tu.find_all('prop')}
        # Parse seg
        segments = {}
        # The order of the langauges might not be consistent, 
        # so keep them in some dict and unstructured first.
        for tuv in tu.find_all('tuv'):
            segment = ' '.join([seg.text for seg in tuv.find_all('seg')])
            segments[tuv.attrs['xml:lang']] = segment
        lol.append({'metadata':metadata, 'properties':properties, 'segments':segments})    # Put the list of rows into a dataframe.
    df = pd.DataFrame(lol)    # See https://stackoverflow.com/a/38231651
    return pd.concat([df.drop(['segments'], axis=1), df['segments'].apply(pd.Series)], axis=1)

TMX(Translation Memory eXchange) files in python

3 Answers3

Install

Usage

Linked