1

I wanna remove all rows in the text file that have have Non-English characters in Powershell, here is what I've tried so far:

Where-Object {( $_ -notlike '[\x00-\x7F]+' ) -or ( $_ -notlike '[\u4e00-\u9fff]')}

However, the Asian characters (Japanese, Korean and Russian) are still there and did not got removed, such as the below:

多発性硬化
多発性硬化症
다발 경화증
다발성 경화증
タハツセイコウカショウ
Рассеянный склероз

Can someone point out what's wrong with my code? Thanks!

David
  • 63
  • 5

2 Answers2

3

Try the following:

PS> 'english only', 'mixed 多発性硬化', '多発性硬化', 'mixed склероз', 'склероз'  | 
      Where-Object { $_ -cnotmatch '\P{IsBasicLatin}' }

english only
  • \p{IsBasicLatin} matches any ASCII-range character (any character in the 7-bit Unicode code-point range, 0x0 - 0x7f), and \P{IsBasicLatin} is its negation, i.e. matches any character outside that range.

  • -cnotmatch '\P{IsBasicLatin}' therefore only matches strings that contain no non-ASCII characters, in other words: strings that contain only ASCII-range characters.

    • NoteTip of the hat to js2010 for the pointer.:
      • -cnotmatch, the case-sensitive variant of the case-insensitive -notmatch operator is deliberately used, so as to rule out false positives that would occur with case-insensitive matching, namely with the lowercase ASCII-range letters i and k.

      • The reason is that these characters are also considered the lowercase counterparts to non-ASCII-range characters, namely İ (LATIN CAPITAL LETTER I WITH DOT ABOVE, U+0130) (as used in Turkic languages), and and (KELVIN SIGN, U+212A); therefore, with case-insensitive matching via -match, i and k report $true for both \p{IsBasicLatin} (falling into the ASCII block) and \P{IsBasicLatin} (falling outside the ASCII block); that is, all of the following expressions return $true:

        # !! All return $true; use -cmatch for the expected behavior.
        'i' -match '\p{IsBasicLatin}'; 'i' -match '\P{IsBasicLatin}'
        'k' -match '\p{IsBasicLatin}'; 'k' -match '\P{IsBasicLatin}'
        
mklement0
  • 382,024
  • 64
  • 607
  • 775
0

Here is a demo if you really want to use -notlike, which uses wildcards. This excludes u+0000 - u+007F or u+4e00 - u+9fff. It ends up not working though. The file this is saved in would have to support full unicode codepoints, utf8 with bom in powershell 5.

$mynull = [char]0x00

'多発性硬化',
'多発性硬化症',
'다발 경화증',
'다발성 경화증',
'タハツセイコウカショウ',
'Рассеянный склероз',
'abc' | where {  $_ -notlike "*[$mynull-⌂]*" -or $_ -notlike '*[一-鿿]*' } 
 

多発性硬化
多発性硬化症
다발 경화증
다발성 경화증
タハツセイコウカショウ
Рассеянный склероз
abc

An example that works. (I converted $end to hex wrong before.)

$beg = [char]0x420
$end = [char]0xff8a

$mystrings = '多発性硬化',
'多発性硬化症',
'다발 경화증',
'다발성 경화증',
'タハツセイコウカショウ',
'Рассеянный склероз',
'',  # 2 surrogate characters in range
'abc'

$mystrings | where { $_ -cnotlike "*[$beg-$end]*" }

#$mystrings | % { $ints = [int[]][char[]]$_; $ints} | sort
#1056-65418
 

abc
js2010
  • 23,033
  • 6
  • 64
  • 66