batch searching for unicode string in textfile

Question

I am trying to look for a unicode string in a text file. My program is working fine when I search for normal characters but now I don't know how to do it with UTF-8.

this is what I am doing with my program:

    FOR %%a IN (\\%ip%\Print\*.txt) DO (findstr /c:"hallo" "%%a"...

this is what I want to do now:

    FOR %%a IN (\\%ip%\Print\*.txt) DO (findstr /c:"привет" "%%a"

hope someone can help me :)

score 0 · Answer 1 · answered Sep 20 '17 at 03:23

0

FINDSTR does not properly read or search any form of unicode. It works exclusively with single byte ANSI (extended ASCII) encodings.

But it is possible to treat your UTF-8 file as extended ASCII and accomplish your search, though it will not print the result to the console properly.

The trick is to place your search string in another UTF-8 file. For this example let's say the search string is stored in "find.txt" with UTF-8 encoding (no BOM).

for %%a in ("\\%ip%\Print\*.txt") do findstr /g:find.txt "%%a"

FINDSTR will not understand the multi-byte unicode codepoints, but instead it will interpret each byte as a character. Any multi-byte unicode codepoints will be printed to the console incorrectly, but the correct matching lines will be printed.

If you redirect the output to a file, then the resultant file will have the correct UTF-8 encoding.

Note that you must use the /G:file option. You cannot use the /C:"string" option because FINDSTR improperly interprets command line arguments that contain byte values above 0x80. See the section titled Character limits for command line parameters - Extended ASCII transformation at What are the undocumented features and limitations of the Windows FINDSTR command? for more information.

answered Sep 20 '17 at 03:23

dbenham

127,446
28
251
390

hey @dbenham, thanks for your answer. But I still have some problems: I want to change the name of the text file that has the suitable text in it so when i try to rename it with: _if not errorlevel 1 (rename "%%a" "bye".txt)_ it does't rename the actual file but my help file (the find.txt)... any suggestions what I do wrong? – Sebastian Sep 21 '17 at 14:57
@Uli - No, I don't follow the logic of what your are trying to do. That sounds like a different question. – dbenham Sep 21 '17 at 15:44
@dbenhm ok well the question still is about findstr and your solution: let's say I have 4 text files findEng.txt findRus.txt Eng.txt and Rus.txt. In the English files there are just normal characters (hello) in the other ones UTF-8 (привет). So when I run my batch _findstr_ _/g:findEng.txt_ _Eng.txt_ this prompts me a hello in my console... with _findstr_ _/g:findRus.txt_ _Rus.txt_ shouldn't the console prompt this in equivalent ansi characters? It is not doing that. In my findRus.txt file there are these ´╗┐ in front of my actual characters... is this the reason that it is not working? – Sebastian Sep 25 '17 at 06:05
@Uli - Read the 4th and 5th paragraphs of my answer carefully, and you should have your answer. Redirect the output of the findRus.txt search to a new file, and look at the result - it should be the correct result with UTF-8 encoding (but without any BOM, unless the original file had the BOM and you happened to match the first line in the original file) – dbenham Sep 25 '17 at 11:52

batch searching for unicode string in textfile

1 Answers1