Find all accented words (diacriticals) using grep?

Question

I have a large list of words in a text file (one word per line) Some words have accented characters (diacriticals). How can I use grep to display only the lines that contain accented characters?

That returns this error: usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...] — R OMS, Oct 18 '17 at 16:51
Looking at [this](https://stackoverflow.com/questions/20690499/concrete-javascript-regex-for-accented-characters-diacritics) answer, you might just match a range of unicode characters `[\u00C0-\u017F]` — Mako212, Oct 18 '17 at 16:52

score 3 · Answer 1 · answered May 07 '20 at 10:53

3

The best solution I have found, for a larger class of characters ("What words are not pure ASCII?") is using PCRE with -P option:

grep -P "[\x7f-\xff]" filename

This will find UTF-8 and ISO-8859-1(5) (Latin1, win1252, cp850) accented characters alike.

answered May 07 '20 at 10:53

LSerni

55,617
10
65
107

A side note. See the ASCII table. A-Z and a-z are represented by \x41-\x5a and \x61-\x7a respectively. Thus `grep -Po "[^\x41-\x5a\x61-\x7a]"` which greps for all characters except A-Za-z works. – Culip Aug 10 '20 at 18:57
Beware that this sometimes doesn't work with file arguments (even though it does when grep is reading from stdin !?). You may need to add `LC_ALL=C ` in front to make it work : `LC_ALL=C grep -P "[\x7f-\xff]" filename` – mivk Nov 21 '21 at 18:56

score 1 · Answer 2 · answered Oct 18 '17 at 16:58

1

I have a solution. First strip the accents using "iconv" then do a "diff" for lines in the original file:

cat text-file | iconv -f utf8 -t ascii//TRANSLIT > noaccents-file
diff text-file noaccents-file | grep '<'

answered Oct 18 '17 at 16:58

R OMS

652
2
7
19

Find all accented words (diacriticals) using grep?

2 Answers2