My problem is, the regex can't find accented words, but in my text
file there are alot of accented words.
My command line is:
cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt
[...]
How can I fix it?
Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.
It gets worse if your words.txt
files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).
To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:
cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt
This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1
. The ugrep output is always UTF-8, however.