1

I have a big file which contains some non ascii chars. I need to find out those records. Note: I am not able to open the file using Notepad++ etc.

I tried something like this from cmd : >findstr /R /N "[^\x00-\x7F]" Test.txt

But this is returning extra rows which doesn't contain non ascii chars

Example of the chars: �

Test.txt contains:

 �      
asdf
sdf asd
1231 sdfg dfg

Result:
1: ∩┐╜  ►←→    ☼    --Expected row
3:sdf asd           --Unexpected row
4:1231 sdfg dfg     --Unexpected row
  • What do you mean by "record", a line? Also, what's the character encoding used by the text file? If you use the right encoding, the characters might not actually be � . – Tom Blodget Mar 23 '17 at 16:52
  • Yes, I mean line, What I want is that I want to find out lines and remove from the file. The encoding is UTF-8 w/o BOM. Also I don't create these files. It will be sent by another resource. They won't change encoding. – Adiya Buyantogtokh Mar 23 '17 at 16:59
  • 1
    Your approach is wrong. UTF-8 files contain two-byte characters, so you cannot simply scan for bytes greater than `\x7F`. Even if it would work using `findstr /V /R /C:"[\x80+\xFF]` (you'd need to specify the characters literally as `findstr` does not understand excaped hex. codes), I am quite shure that it would not work due to [`findstr` bugs and limitations](http://stackoverflow.com/q/8844868)... – aschipfl Mar 23 '17 at 19:13
  • Install gVim and use your regex there. – Wiktor Stribiżew Mar 23 '17 at 20:14
  • 1
    @aschipfl - True that FINDSTR cannot. But by definition, any byte between 0x80-0xFF is non-ASCII. UTF-8 cannot encode a non-ASCII character without such a byte. So a line contains a non-ASCII character if and only if it contains at least one byte between 0x80-0xFF. – dbenham Mar 24 '17 at 05:21
  • You could use [JREPL.BAT](http://www.dostips.com/forum/viewtopic.php?t=6044) - `jrepl "[^\x00-\x7F]" "" /k 0 /f yourfile.txt` – dbenham Mar 24 '17 at 05:31

1 Answers1

0

Try this

set "F=1.txt" & echo var r=0,c=0,l,s=(new ActiveXObject("Scripting.FileSystemObject")).OpenTextFile(WScript.Arguments(0),1);while(!s.AtEndOfStream){++c;if(/[^^\r\n\x20-\x7f]/.test(l=s.ReadLine())){r=1;WScript.Echo(c+": "+l);}}s.Close();WScript.Quit(r);>"%TEMP%\1.js" & (call cscript /nologo "%TEMP%\1.js" "%F%") & del "%TEMP%\1.js" & set "F="

set "F=1.txt" is a file to test.

echo ... >"%TEMP%\1.js" creates a JScript file that will read %F% and test if it contains /[^\r\n\x20-\x7f]/ symbols.

cscript launches the created script.

del "%TEMP%\1.js" & set "F=" is a cleanup.

Dmitry Sokolov
  • 3,118
  • 1
  • 30
  • 35