8

I am trying to find a way using Powershell Script to do the following.

  1. For each line in text file, check if line contains non-ASCII characters
  2. If line contains non-ASCII characters, output to separate file
  3. If line does not contain non-ASCII characters, skip to next line

By non-ASCII characters, I'm referring to non keyboard characters, e.g. accented characters, characters from another language, etc.

Sample Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Full Name
 - Léna Rémi

Output Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Léna Rémi

I found the regex in other threads to remove non-ASCII characters but I couldn't seem to make it work.

Please help!

** EDIT ** Thanks everyone for the help! I have managed to do what I wanted with the below script.

$nonASCII = "[^\x00-\x7F]"
foreach ($_ in [System.IO.File]::ReadLines($source)){
    if ($_ -cmatch $nonASCII){
        write-output $_ | out-File $output -append        
    }
}
Arolix
  • 81
  • 1
  • 1
  • 3
  • 1
    What do you exactly mean by non-ascii characters? Which encodings are you using? Can you add some sample data with desired outputs? – vonPryz May 14 '20 at 08:21
  • @vonPryz I've edited my main thread to further elaborate on what I hope to achieve. – Arolix May 14 '20 at 12:58
  • you can use a negated character class and test for that class. something like `-match '[^0-9a-z]'`. plus, there are supposed to be ways to specify unicode character classes. i can't recall how, tho ... [*blush*] – Lee_Dailey May 14 '20 at 13:06
  • Test with regex `^[\x20-\x7e]+$`? That should match characers from `space` to `tilde`. Don't use `[ -~]` insead, unless you want the next maintainer a lot of headache. – vonPryz May 14 '20 at 13:08
  • @Lee_Dailey I would use -cnotmatch. There's some funny exceptions otherwise between the capital and small versions of characters. – js2010 May 14 '20 at 13:08
  • @js2010 - i have not run into that ... yet. thank you for the info! [*grin*] – Lee_Dailey May 14 '20 at 13:35
  • @Lee_Dailey Actually your test works, but the letter i passes as non-ascii in this test: `echo i I | where { $_ -match '[\u0080-\uffff]' }`. – js2010 May 14 '20 at 13:48

3 Answers3

8

Define a character set that describes all ASCII characters (code points 32 through 127 == [\x20-\x7F]), then negate it with ^ to match any non-ASCII character!

Let's test it against my (non-ASCII) name:

PS C:\> 'Mathias R. Jessen' -cmatch '[^\x20-\x7F]'
False
PS C:\> 'Mathias Rørbo Jessen' -cmatch '[^\x20-\x7F]'
True

To filter a list of strings, simply use the -cmatch operator in filter mode:

$strings = 'குழந்தைகளுக்கான பெயர்கள்', 'Boring John Doe', 'Léna Rémi'

$nonASCIIstrings = @($strings) -cmatch '[^\x20-\x7F]'

Or if you want to filter along a pipeline, use Where-Object:

$strings |Where-Object {$_ -cmatch '[^\x20-\x7F]'}
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
3

The .NET regex engine supports a direct expression of the concept "non-ASCII character": \P{IsBasicLatin} (the inverse, i.e. "ASCII character", is \p{IsBasicLatin}):

' - 张伟',
' - குழந்தைகளுக்கான பெயர்கள்',
' - 日本人の氏名',
' - Full Name', 
' - Léna Rémi' -cmatch '\P{IsBasicLatin}'

IsBasicLatin is an example of a named (Unicode) block.

The above requires -cmatch, the case-sensitive variant of -match,[1] the regular-expression matching operator, to output those input lines (array elements) that contain at least one non-ASCII-range character:

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Léna Rémi

For a streaming solution - processing lines read from a file one by one, you can combine -match with the Where-Object cmdlet:

Get-Content in.txt | 
  Where-Object { $_ -cmatch '\P{IsBasicLatin}' } |
    Set-Content -Encoding Utf8 out.txt

Note that Get-Content is used to read the file line by line - while System.IO.File]::ReadLines("$pwd\in.txt") works too, it is only necessary if there's a performance problem.


[1] The reason is that with case-insensitive matching, the lowercase ASCII i and k characters are considered both inside and outside the ASCII block, i.e. 'i' -match '\P{IsBasicLatin}' and 'i' -match '\p{IsBasicLatin}' are both $true. For an explanation, see this answer. Tip of the hat to js2010.

mklement0
  • 382,024
  • 64
  • 607
  • 775
2

Here's a script I have to remove non-ascii characters from an xml file. Maybe you can use it as a starting point. I'm removing characters that are not between space and tilde in the ascii table, and also not tab. To me, ascii is in the range 0-127. Get-content takes out the carriage returns and linefeeds.

(get-content $args[0]) -replace '[^ -~\t]' | set-content $args[0]
js2010
  • 23,033
  • 6
  • 64
  • 66