Regex matching strings with mixture of Japanese and English characters

Question

I have this script in PowerShell which I am going to use eventually to translate an XML file with some Japanese words and replace with the English. For now this is a simple regex matching example:

$pattern = "(?<=\>)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=\<\/)"
$text = 'tag3>日本語</tag>漢字</tag>.'

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

This works fine and will return the following:

日本語
漢字

However, I want it to also grab on or more English characters before or after the Japanese characters, and the whole thing wrapped between > and </

For this string:

tag3>Some text before 日本語 and some text after</tag><Before text 漢字</tag>

It should grab these:

Some text before 日本語 and some text after
Before text 漢字

Extracting elements from an XML with regex makes no sense. There are parsers for that already i.e. `XmlDocument` class. — Santiago Squarzon, Jun 11 '23 at 21:16
As an aside: Using the [automatic `$Matches` variable](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Automatic_Variables#matches) for custom purposes is best avoided. — mklement0, Jun 11 '23 at 21:18
See also: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — lit, Jun 12 '23 at 22:28

mklement0 · Accepted Answer · 2023-06-11T22:21:53.967

The obligatory general recommendation:

String parsing of XML text is best avoided, because it is inherently limited and brittle; it's always preferable to use a dedicated XML parser, such as .NET's System.Xml.XmlDocument class, which PowerShell provides easy access to via its [xml] type accelerator and the property-based adaptation of the XML DOM; see this answer for an example.

You can refine your regex as follows:

$pattern = '(?<=[^/]>)[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'

$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'

# Outputs directly to the console for diagnostic purposes.
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value }

Output:

Some text before 日本語 and some text after
Before text 漢字

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

Glad to hear it, @Aziz. PowerShell's support goes beyond just the simple type name, `[xml]`: it provides _property-based_ access to the XML DOM; see [this answer](https://stackoverflow.com/a/75857414/45375) for an example — mklement0, Jun 11 '23 at 22:23

Regex matching strings with mixture of Japanese and English characters

1 Answers1