7

I'm trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

And the content of file is:

carroça
éra
éssa
roça
roco
rato
onça
orça
roca

How can I fix it?

tchrist
  • 78,834
  • 30
  • 123
  • 180
GodFather
  • 3,031
  • 4
  • 25
  • 36
  • 1
    What is the output of `locale`? What is the encoding of `input/words.txt`? – ephemient Jan 19 '11 at 19:07
  • 2
    It works for me, but maybe the problem is with your syntax: square brackets are used to define groups of characters, so at least the second line is definitely wrong. Try: grep '^carroça\{1,3\}$' – UncleZeiv Jan 19 '11 at 19:11
  • @UncleZeiv, I had put the regex wrong, now I edited with the correct. – GodFather Jan 19 '11 at 19:15
  • @ephemient, the locale is: LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= . The encoding of input/words is ISO-8859-1 – GodFather Jan 19 '11 at 19:18
  • ok I see what you want to do, I was just saying that repeating characters in the regex doesn't look right, I would have but `[caroç]` but your works as well. – UncleZeiv Jan 19 '11 at 19:19
  • 3
    It may work but it's still "wrong" in the sense that the regex doesn't really say what you're trying to do. It *looks* like you're trying to match the word `carroça` but it *says* to match any sequence of 1 to 7 of the letters listed. Ziev's shorter `[caroç]` is indeed better. Both will match `carroça` and will also match `roca` and `orça` etc. but will not match `éssa` or `éra`. I point this out only because it seems you *might* not be entirely clear on what the square brackets do in regex. – Stephen P Jan 19 '11 at 19:30

5 Answers5

12

If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

# convert from ISO-8859-1 to the environmental locale before grepping
# output will be in the current locale
$ iconv -f 8859_1 input/words.txt | grep ...

# run grep with an ISO-8859-1 locale
# output will be in ISO-8859-1 encoding
$ cat input/words.txt | env LC_ALL=en_US grep ...
ephemient
  • 198,619
  • 38
  • 280
  • 391
2

I found a related question here that seems to work.

So if you try something like:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does that produce what you expect?

Community
  • 1
  • 1
dule
  • 17,798
  • 4
  • 39
  • 38
1

Assuming everything is UTF-8, I’d usually just use something like

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

tchrist
  • 78,834
  • 30
  • 123
  • 180
0

Try as @dule said, but with LANG=en_US.iso88591:

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt
UncleZeiv
  • 18,272
  • 7
  • 49
  • 77
  • @ephemient: I found it using `locale -a` and this was tested on my machine and it works, after reproducing the same situation as GodFather's. – UncleZeiv Jan 19 '11 at 19:35
  • Depends on how the system was set up (possibly influenced by `/etc/locale.gen`) but having named ISO-8859-1 locales is not common in Linux distributions anymore. – ephemient Jan 19 '11 at 19:56
  • @ephemient: I see; indeed, I'm working on a very old Linux distribution – UncleZeiv Jan 20 '11 at 09:59
0

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

[...]

How can I fix it?

Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.

It gets worse if your words.txt files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).

To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:

cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt

This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1. The ugrep output is always UTF-8, however.

Dr. Alex RE
  • 1,772
  • 1
  • 15
  • 23