-2

I have this script in PowerShell which I am going to use eventually to translate an XML file with some Japanese words and replace with the English. For now this is a simple regex matching example:

$pattern = "(?<=\>)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=\<\/)"
$text = 'tag3>日本語</tag>漢字</tag>.'

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

This works fine and will return the following:

日本語
漢字

However, I want it to also grab on or more English characters before or after the Japanese characters, and the whole thing wrapped between > and </

For this string:

tag3>Some text before 日本語 and some text after</tag><Before text 漢字</tag>

It should grab these:

Some text before 日本語 and some text after
Before text 漢字
Aziz
  • 283
  • 1
  • 14
  • 1
    Extracting elements from an XML with regex makes no sense. There are parsers for that already i.e. `XmlDocument` class. – Santiago Squarzon Jun 11 '23 at 21:16
  • 1
    As an aside: Using the [automatic `$Matches` variable](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Automatic_Variables#matches) for custom purposes is best avoided. – mklement0 Jun 11 '23 at 21:18
  • See also: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – lit Jun 12 '23 at 22:28

1 Answers1

1

The obligatory general recommendation:

  • String parsing of XML text is best avoided, because it is inherently limited and brittle; it's always preferable to use a dedicated XML parser, such as .NET's System.Xml.XmlDocument class, which PowerShell provides easy access to via its [xml] type accelerator and the property-based adaptation of the XML DOM; see this answer for an example.

You can refine your regex as follows:

$pattern = '(?<=[^/]>)[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'

$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'

# Outputs directly to the console for diagnostic purposes.
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value } 

Output:

Some text before 日本語 and some text after
Before text 漢字

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Thank you! Will look into the .NET XML Parser – Aziz Jun 11 '23 at 22:12
  • Glad to hear it, @Aziz. PowerShell's support goes beyond just the simple type name, `[xml]`: it provides _property-based_ access to the XML DOM; see [this answer](https://stackoverflow.com/a/75857414/45375) for an example – mklement0 Jun 11 '23 at 22:23