grep/regex can't find accented word

Question

I'm trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

And the content of file is:

carroça
éra
éssa
roça
roco
rato
onça
orça
roca

How can I fix it?

What is the output of `locale`? What is the encoding of `input/words.txt`? — ephemient, Jan 19 '11 at 19:07
It works for me, but maybe the problem is with your syntax: square brackets are used to define groups of characters, so at least the second line is definitely wrong. Try: grep '^carroça\{1,3\}$' — UncleZeiv, Jan 19 '11 at 19:11
@UncleZeiv, I had put the regex wrong, now I edited with the correct. — GodFather, Jan 19 '11 at 19:15
@ephemient, the locale is: LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= . The encoding of input/words is ISO-8859-1 — GodFather, Jan 19 '11 at 19:18
ok I see what you want to do, I was just saying that repeating characters in the regex doesn't look right, I would have but `[caroç]` but your works as well. — UncleZeiv, Jan 19 '11 at 19:19
It may work but it's still "wrong" in the sense that the regex doesn't really say what you're trying to do. It *looks* like you're trying to match the word `carroça` but it *says* to match any sequence of 1 to 7 of the letters listed. Ziev's shorter `[caroç]` is indeed better. Both will match `carroça` and will also match `roca` and `orça` etc. but will not match `éssa` or `éra`. I point this out only because it seems you *might* not be entirely clear on what the square brackets do in regex. — Stephen P, Jan 19 '11 at 19:30

score 12 · Accepted Answer · answered Jan 19 '11 at 19:26

12

If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

# convert from ISO-8859-1 to the environmental locale before grepping
# output will be in the current locale
$ iconv -f 8859_1 input/words.txt | grep ...

# run grep with an ISO-8859-1 locale
# output will be in ISO-8859-1 encoding
$ cat input/words.txt | env LC_ALL=en_US grep ...

answered Jan 19 '11 at 19:26

ephemient

198,619
38
280
391

Dude, the first option "iconv" works. Thanks. The output now is carroça roça roco orça roca car raa – GodFather Jan 19 '11 at 19:39

score 2 · Answer 2 · edited Mar 20 '17 at 10:18

2

I found a related question here that seems to work.

So if you try something like:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does that produce what you expect?

edited Mar 20 '17 at 10:18

Community

1
1

answered Jan 19 '11 at 19:18

dule

17,798
4
39
38

unfortunately no, the output is the same. – GodFather Jan 19 '11 at 19:22
forgot to escape the \, so they weren't showing up in the post – dule Jan 19 '11 at 19:27
1

in these cases just add some space at the front and the line will format as code, which is more readable and doesn't need escapes. I've done this for you here – UncleZeiv Jan 20 '11 at 09:58

score 1 · Answer 3 · answered Jan 19 '11 at 21:51

1

Assuming everything is UTF-8, I’d usually just use something like

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

answered Jan 19 '11 at 21:51

tchrist

78,834
30
123
180

The comments (eventually) make it clear that not everything is UTF-8, though. – ephemient Jan 19 '11 at 22:33
1

@ephemient Encoding tribulations seem to be endless, don’t they? – tchrist Jan 19 '11 at 22:51

UncleZeiv · Answer 4 · 2011-01-20T09:58:41.083

0

Try as @dule said, but with LANG=en_US.iso88591:

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt

edited Jan 20 '11 at 09:58

answered Jan 19 '11 at 19:24

UncleZeiv

18,272
7
49
77

@ephemient: I found it using `locale -a` and this was tested on my machine and it works, after reproducing the same situation as GodFather's. – UncleZeiv Jan 19 '11 at 19:35
Depends on how the system was set up (possibly influenced by `/etc/locale.gen`) but having named ISO-8859-1 locales is not common in Linux distributions anymore. – ephemient Jan 19 '11 at 19:56
@ephemient: I see; indeed, I'm working on a very old Linux distribution – UncleZeiv Jan 20 '11 at 09:59

score 0 · Answer 5 · answered Jan 13 '20 at 22:02

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

My command line is:
cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

[...]
How can I fix it?

Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.

It gets worse if your words.txt files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).

To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:

cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt

This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1. The ugrep output is always UTF-8, however.

grep/regex can't find accented word

5 Answers5

Linked