Powershell find non-ASCII characters in text file

Question

I am trying to find a way using Powershell Script to do the following.

For each line in text file, check if line contains non-ASCII characters
If line contains non-ASCII characters, output to separate file
If line does not contain non-ASCII characters, skip to next line

By non-ASCII characters, I'm referring to non keyboard characters, e.g. accented characters, characters from another language, etc.

Sample Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Full Name
 - Léna Rémi

Output Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Léna Rémi

I found the regex in other threads to remove non-ASCII characters but I couldn't seem to make it work.

Please help!

** EDIT ** Thanks everyone for the help! I have managed to do what I wanted with the below script.

$nonASCII = "[^\x00-\x7F]"
foreach ($_ in [System.IO.File]::ReadLines($source)){
    if ($_ -cmatch $nonASCII){
        write-output $_ | out-File $output -append        
    }
}

What do you exactly mean by non-ascii characters? Which encodings are you using? Can you add some sample data with desired outputs? — vonPryz, May 14 '20 at 08:21
@vonPryz I've edited my main thread to further elaborate on what I hope to achieve. — Arolix, May 14 '20 at 12:58
you can use a negated character class and test for that class. something like `-match '[^0-9a-z]'`. plus, there are supposed to be ways to specify unicode character classes. i can't recall how, tho ... [*blush*] — Lee_Dailey, May 14 '20 at 13:06
Test with regex `^[\x20-\x7e]+$`? That should match characers from `space` to `tilde`. Don't use `[ -~]` insead, unless you want the next maintainer a lot of headache. — vonPryz, May 14 '20 at 13:08
@Lee_Dailey I would use -cnotmatch. There's some funny exceptions otherwise between the capital and small versions of characters. — js2010, May 14 '20 at 13:08
@js2010 - i have not run into that ... yet. thank you for the info! [*grin*] — Lee_Dailey, May 14 '20 at 13:35
@Lee_Dailey Actually your test works, but the letter i passes as non-ascii in this test: `echo i I | where { $_ -match '[\u0080-\uffff]' }`. — js2010, May 14 '20 at 13:48

Mathias R. Jessen · Answer 1 · 2020-05-14T13:18:16.897

Define a character set that describes all ASCII characters (code points 32 through 127 == [\x20-\x7F]), then negate it with ^ to match any non-ASCII character!

Let's test it against my (non-ASCII) name:

PS C:\> 'Mathias R. Jessen' -cmatch '[^\x20-\x7F]'
False
PS C:\> 'Mathias Rørbo Jessen' -cmatch '[^\x20-\x7F]'
True

To filter a list of strings, simply use the -cmatch operator in filter mode:

$strings = 'குழந்தைகளுக்கான பெயர்கள்', 'Boring John Doe', 'Léna Rémi'

$nonASCIIstrings = @($strings) -cmatch '[^\x20-\x7F]'

Or if you want to filter along a pipeline, use Where-Object:

$strings |Where-Object {$_ -cmatch '[^\x20-\x7F]'}

mklement0 · Answer 2 · 2022-12-25T23:27:54.683

The .NET regex engine supports a direct expression of the concept "non-ASCII character": \P{IsBasicLatin} (the inverse, i.e. "ASCII character", is \p{IsBasicLatin}):

' - 张伟',
' - குழந்தைகளுக்கான பெயர்கள்',
' - 日本人の氏名',
' - Full Name', 
' - Léna Rémi' -cmatch '\P{IsBasicLatin}'

IsBasicLatin is an example of a named (Unicode) block.

The above requires -cmatch, the case-sensitive variant of -match,^[1] the regular-expression matching operator, to output those input lines (array elements) that contain at least one non-ASCII-range character:

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Léna Rémi

For a streaming solution - processing lines read from a file one by one, you can combine -match with the Where-Object cmdlet:

Get-Content in.txt | 
  Where-Object { $_ -cmatch '\P{IsBasicLatin}' } |
    Set-Content -Encoding Utf8 out.txt

Note that Get-Content is used to read the file line by line - while System.IO.File]::ReadLines("$pwd\in.txt") works too, it is only necessary if there's a performance problem.

^{[1] The reason is that with case-insensitive matching, the lowercase ASCII i and k characters are considered both inside and outside the ASCII block, i.e. 'i' -match '\P{IsBasicLatin}' and 'i' -match '\p{IsBasicLatin}' are both $true. For an explanation, see this answer. Tip of the hat to js2010.}

score 2 · Answer 3 · answered May 14 '20 at 13:18

Here's a script I have to remove non-ascii characters from an xml file. Maybe you can use it as a starting point. I'm removing characters that are not between space and tilde in the ascii table, and also not tab. To me, ascii is in the range 0-127. Get-content takes out the carriage returns and linefeeds.

(get-content $args[0]) -replace '[^ -~\t]' | set-content $args[0]

Powershell find non-ASCII characters in text file

3 Answers3

Linked