-1

I have a file containing filenames that look like this "aaa.ext"
"abc"
"a1a.ext"
"béa"
"pàt"
"ff#!"
"toto & #128;.pdf"
"..."

I need to extract the lines that contain standard English Alphanumerals (A-Z, a-z, 0-9, _ and .) AND other characters

Concerning the above example the output should be like
béa (contains é instead of e)
pàt (contains à instead of a)
"ff#!"
"toto & #128;.pdf"

Any ideas?

Thanks in advance

BNT
  • 11
  • 1
  • 7

1 Answers1

1

Try

LC_ALL=C.UTF-8 grep '[A-Za-z0-9_.]' yourFile |
LC_ALL=C.UTF-8 grep '[^A-Za-z0-9_.]'

which can also be written as

(export LC_ALL=C.UTF-8; grep -P '[\w.]' yourFile | grep -P '[^\w.]')

LC_ALL=C.UTF-8 ensures that A-Z only matches standard english letters and not letters like é.

Note: In Unicode é can be encoded as either the real é or an e combined with a ´. If your file contains the following two lines (without comments)

é # single character
é # combination of "e" and "´"

then the command from above will return

é # combination of "e" and "´"

The problem is a bit exotic and shouldn't cause much trouble.

Socowi
  • 25,550
  • 3
  • 32
  • 54
  • Hi Socowi +1 for the quick response. The suggested command highlights the special characters however it does not seep to exclude the lines which don't contain any – BNT Mar 07 '17 at 14:35
  • @BNT Strange... I tested both commands for your example and got the desired results. Can you make another example in which a line without special characters is accepted? – Socowi Mar 07 '17 at 14:43
  • here are a few more examples
    12 - Mémo.pdf
    2016-04-25 오후 7.59.12.jpg
    20161109133127734.pdf
    ~9963007Opoto.pdf
    In the above example; lines 2 and 4 should be retrieved, 1 and 3 should not
    Thanks again
    – BNT Mar 07 '17 at 15:50
  • @BNT Thanks for the additional example. For me the command accepts line 1, 2, and 4, which seems fine. Why should `12 - Mémo.pdf` be rejected? It contains the standard letters `12Mmo.pdf` *and* the other characters `-é` (note: space is also an "other character"). What is accepted on your system? – Socowi Mar 08 '17 at 19:14
  • thanks for your feedback. my aim is to remove all non english chars from the filenames as they will be used on a wide variety of systems. The only acceptable characters are A-Z, a-z, 0-9, _ - and . all others should be sorted out – BNT Mar 10 '17 at 10:52
  • @BNT Um, that's a completley different question ([this answer](http://stackoverflow.com/q/3264915/6770384) may be of interest to you). Please close this question and ask a new one. – Socowi Mar 11 '17 at 10:09
  • I might have explained my request in a wrong way. The requirement is that in the list of filenames contained in a text file, I need to extract the lines containing characters that are not within the English Alphanumeric set + the _ (underscore), the - (dash) and the . (dot). The answer you mentioned includes all ASCII so does not limit to only alphanumeric... – BNT Mar 13 '17 at 14:17