In our company we have people from different countries that translate texts into their mother tongue. Few years ago, we developed a translation tool. With that tool both translators and people that need translations can handle this translation process way better than sending emails.
Now we want to improve the tool and automate translations using tools like Google's or Deepl's so our translators won't have to translate, just check. This hopefully will save them a lot of time. But we have some difficulties handling complex HTML content such us articles. I have tried DeepL and seems to return a more accurate and natural translation. But it is translating content inside HTML tags. For example href attributes are being translated so links won't work. Wether I use Google or DeepL I would like to extract sentences so I don't get charged by HTML chars.
I have read:
Temporary removal of HTML from string for Google Translate API to reduce cost
Exclude HTML tags when translating with Google Translate API https://stackoverflow.com/a/1732454/5126638
Extract sentences from HTML in PHP
We have PHP code that clean all HTML tags with strip_tags() and split result text into sentences. After that, each sentence is checked in the DB. Sentences that are already translated are replaced (str_replace()) inside original HTML text. In this way I get HTML content translated into another language.
I expected to properly translate any HTML, but self contained tags break the logic. The code works with things like:
<p><ul><li>Article about our web page</li></ul></p>
But can't handle:
<p><ul><li>Article about our <strong>web page</strong></li></ul></p>
When HTML tags are deleted the sentence is "Article about our web page". After translate that, it tries to replace into the original text and fails. Due to str_replace can't find that sentence, there is <strong>
in the middle.
How could I improve my code to translate full HTML content?
I have checked and Google translate properly handle this. How do they make this work? Is there any library developed?
EDIT: Some examples:
<tr align="left" valign="middle">
<td height="22"><strong>Identification time</strong></td>
<td height="22">< 0.5 Sec.</td>
</tr>
<tr align="left" valign="middle">
<td height="22"><strong>Power supply</strong></td>
<td>DC 5 V / 1.0 A (included)</td>
</tr>
<tr align="left" valign="middle">
<td height="22"><strong>Temp. operation</strong></td>
<td>-30º C ~ +60º C</td>
</tr>