-3

I use a translator tool to translate English into Simplified Chinese.

Now there is an issue with the period.

In English at the finish point of a sentence, we use full stop "." In Simplified Chinese, it is "。"which looks like a small circle.

The translation tool mistakenly add this "small circle" / full stop to every major subtitles.

Is there a way to use Regex or other methods to scan the translated content, and replace any "small circle" / Chinese full stop symbol when the line has only 20 characters or less?

Some test data like below

<h1>这是一个测试。<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。

测试。
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。

It shall turn into:

<h1>这是一个测试<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。

测试
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。

Difference: Line 1 it only has 10 characters, and shall have Chinese full stop removed. Line 4 is a sub heading, it only has 4 characters, and shall have full stop removed too.

By the way, I was told 1 Chinese word is two English characters.

Is this possible?

Sweeper
  • 213,210
  • 22
  • 193
  • 313
stackmike
  • 73
  • 5
  • It seems like you are deciding whether to remove a fullstop based on whether that line is a "heading". How do you tell whether a line is a "heading"? – Sweeper Apr 09 '21 at 01:43
  • There are two ways. First: if a line has less than 20 characters, then it should not have a small full stop. This will catch most of these errors. Second: maybe this one is more accurate: if there is no comma in this line, it should not have a full stop. – stackmike Apr 09 '21 at 01:46
  • Which programming language/regex engine are you using? – Hao Wu Apr 09 '21 at 01:51
  • I am using an automation tool. It can run c# or javascript, and regex. I am not sure what engine version it is. – stackmike Apr 09 '21 at 01:53
  • If I use the Chinese full stop symbol in the built in tester, it would remove all full stop. I wish to have a condition. In this image below, remove the blue arrow highlighted ones. When the line has a comma, the full stop is ok. https://i.imgur.com/iX0KSlN.jpg – stackmike Apr 09 '21 at 01:59

1 Answers1

0

I'm using the approach 2

Second: maybe this one is more accurate: if there is no comma in this line, it should not have a full stop.

to determine whether a full stop should be removed.

Regex

/^(?=.*。)(?!.*,)([^。]*)。/mg
  • ^ start of a line
  • (?=.*。) match a line that contains
  • (?!.*,) match a line that doesn't contain
  • ([^。]*)。 anything that not a full stop before a full stop, put it in group 1

Substitution

$1

Check the test cases here

But do mind this only removes the first full stop.

If you want to remove all the full stops, you can try (?:\G|^)(?=.*。)(?!.*,)(.*?)。 but this only works for regex engines supports \G such as pcre.

Also, if you want to combine the two approaches(a line has no period and the length is less than 20 characters), you can try ^(?=.{1,20}$)(?=.*。)(?!.*,)([^。]*)。

Hao Wu
  • 17,573
  • 6
  • 28
  • 60