1

I'm downloading BibTex entries, but often there are random characters which will not show up in the output PDF, e.g. àèìòùáéíóúäëïöüâêîôûÿøÀÈÌÒÙÁÉÍÓÚÄËÏÖÜ but this list isn't comprehensive (random foreign letters which I cannot type).

I've tried things like

grep -nP '[^a-zA-Z0-9\/,=!@#$%^&*()_]' ~/Documents/Library.bib

but there has got to be an easier way than this.

How can I grep or use perl regex for any character I cannot type on the keyboard (is this ASCII too?) e.g. if I have an "n" with an accent is there some way I can know that?

con
  • 5,767
  • 8
  • 33
  • 62
  • Information about the keyboard layout depends on your operating system. Do you use Linux, BSD, or a Mac? – Socowi Apr 05 '17 at 17:20
  • @Socowi this is Ubuntu Linux – con Apr 05 '17 at 17:20
  • 1
    Which encoding is the `Library.bib` file in? – Håkon Hægland Apr 05 '17 at 17:25
  • @HåkonHægland text/plain; charset=utf-8 – con Apr 05 '17 at 17:29
  • *"there are random characters which will not show up in the output PDF"* That should work. It's very rare to find a font that doesn't include the accented European characters. *"How can I grep or use perl regex for any character I cannot type on the keyboard"* I don't understand. Why do you want to be able to type them? They're already in the data. Surely you want to fix them not appearing in the PDF? Also, you can type any Unicode character in bash by Ctrl/Shift/U followed by the four-digit hex codepoint. – Borodin Apr 05 '17 at 17:33
  • That would depend on your keyboard. What characters can your keyboard type? – ikegami Apr 05 '17 at 17:34
  • If it just accepts 4 hex characters, it actually covers only 1/17th of Unicode Code Points. – ikegami Apr 05 '17 at 17:35
  • 1
    There is also a POSIX character class `[:ascii:]` that matches any character in the ASCII character set – Håkon Hægland Apr 05 '17 at 17:38
  • 1
    `[[:ascii:]]` is equilvalent to `[\x00-\x7F]`; so it includes control chars. – ikegami Apr 05 '17 at 17:39
  • 1
    @HåkonHægland thank you! that works perfectly, I'd accept it as the answer if you wrote it in the answer area `grep -P [^[:ascii:]] ~/Documents/Library.bib` – con Apr 05 '17 at 17:46

1 Answers1

3

You can use the POSIX character class [:ascii:] that matches any character in the ASCII character set. To print all lines with non ASCII characters:

grep -nP '[^[:ascii:]]' ~/Documents/Library.bib

or, to also highlight the non-ASCII characters:

grep --color=auto -nP '[^[:ascii:]]' ~/Documents/Library.bib

See Character Classes and Bracket Expressions in the Gnu grep manual and POSIX Bracket Expressions at Regular-Expression.info for more information.

See also: How do I grep for all non-ASCII characters in UNIX

Community
  • 1
  • 1
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174