How to retrieve sentences from HTML in order to translate them and insert them back into the HTML?

Question

In our company we have people from different countries that translate texts into their mother tongue. Few years ago, we developed a translation tool. With that tool both translators and people that need translations can handle this translation process way better than sending emails.

Now we want to improve the tool and automate translations using tools like Google's or Deepl's so our translators won't have to translate, just check. This hopefully will save them a lot of time. But we have some difficulties handling complex HTML content such us articles. I have tried DeepL and seems to return a more accurate and natural translation. But it is translating content inside HTML tags. For example href attributes are being translated so links won't work. Wether I use Google or DeepL I would like to extract sentences so I don't get charged by HTML chars.

I have read:

Temporary removal of HTML from string for Google Translate API to reduce cost

Exclude HTML tags when translating with Google Translate API https://stackoverflow.com/a/1732454/5126638

Extract sentences from HTML in PHP

We have PHP code that clean all HTML tags with strip_tags() and split result text into sentences. After that, each sentence is checked in the DB. Sentences that are already translated are replaced (str_replace()) inside original HTML text. In this way I get HTML content translated into another language.

I expected to properly translate any HTML, but self contained tags break the logic. The code works with things like:

<p><ul><li>Article about our web page</li></ul></p>

But can't handle:

<p><ul><li>Article about our <strong>web page</strong></li></ul></p>

When HTML tags are deleted the sentence is "Article about our web page". After translate that, it tries to replace into the original text and fails. Due to str_replace can't find that sentence, there is <strong> in the middle.

How could I improve my code to translate full HTML content?

I have checked and Google translate properly handle this. How do they make this work? Is there any library developed?

EDIT: Some examples:

<tr align="left" valign="middle">
<td height="22"><strong>Identification time</strong></td>
<td height="22">&lt; 0.5 Sec.</td>
</tr>

<tr align="left" valign="middle">
<td height="22"><strong>Power supply</strong></td>
<td>DC 5 V / 1.0 A (included)</td>
</tr>

<tr align="left" valign="middle">
<td height="22"><strong>Temp. operation</strong></td>
<td>-30º C ~ +60º C</td>
</tr>

Mozammil · Answer 1 · 2019-01-21T16:59:45.873

You could use preg_replace_callback() to identify and replaces the words in your html string. Ideally the regex pattern should also exclude html tags and not treat, for example, <strong> as a word.

A very naive implementation could be something like this:

$string = '<p><ul><li>Article about our <strong>web page</strong></li></ul></p>';

return preg_replace_callback(
    '/\b(\w+(?![^<>]*>))\b/',
    function ($matches) {
        return strtoupper($matches[0]);
    },
    $string
);

In my particular case, the output will be:

<p><ul><li>ARTICLE ABOUT OUR <strong>WEB PAGE</strong></li></ul></p>

I am just converting the words to uppercase.

You should replace that with your logic to get the translated word instead. In your case, like you said, translating whole sentences may not work or could prove to be very difficult.

However, if you switch your logic to translating words instead, maybe that could be easier to manipulate? Let me know your thoughts :)

Hi, thanks for your response. Seems to be an interesting solution but I'm afraid it won't be enough in my case. I'm checking the regex and it does not macth < or > char. We need to translate sentences like "Identification time < 0.5 Sec.". However, I'll think about preg_replace_callback(), thanks =). Regarding switching to translate each word is not feasible to me. A single word could have multiple translations, the correct one is given by the context. Splitting at that level you lose that. — Adrián Rodriguez, Jan 23 '19 at 08:11

How to retrieve sentences from HTML in order to translate them and insert them back into the HTML?

1 Answers1