How to split a file containing non-ascii characters into words, in bash?

Question

For example, I have a file with normal text, like:

"Word1 Kuͦn, buͤtten; word4:"

I want to get a file with 1 word per line, keeping the punctiuation, and ordered:

,
:
;
Word1
Kuͦn
buͤtten
word4

The code I use:

grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt

This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:

    ,
    :
    ;
    Word1
    Ku
    ͦ      
    n
    bu 
    ͤ   
    tten
    word4

The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?

Edit:

locale output:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Can you please update your question including the output of `locale`? — jaypal singh, Sep 17 '14 at 19:46

rici · Accepted Answer · 2014-09-18T20:20:45.093

Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.

Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.

To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.

Putting all that together, you might end up with something like the following:

grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]'

-P      patterns are "perl-compatible"
 -o     output each substring matched

(*UTF8) If the pattern starts with exactly this sequence,
        pcre is put into UTF-8 mode.
\p{...} Select a character in a specified Unicode general category
\P{...} Select a character not in a specified Unicode general category
\p{L}   General category L: letters
\p{N}   General category N: numbers
\p{M}   General category M: combining marks
\p{P}   General category P: punctuation
\p{S}   General category S: symbols
\p{L}\p{M}*       A letter possibly followed by various combining marks
\p{L}\p{M}*|\p{N} ... or a number

More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].

Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.

If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:

perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'

Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

Wow, I would never come up with this... Thank you! Though I tried the command, and yields error: `>> grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]' sample_corpus.clean.txt >> usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered]` — user3241376, Sep 17 '14 at 21:30
@user3241376: The you probably don't have Gnu grep, or you have a very old version. What does `grep -V` tell you? — rici, Sep 17 '14 at 21:35
@user3241376 I suppose you have Mac OS X. According to http://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches support for `grep -P` was removed last year with NacOSX 10.8. That answer has a perl alternative. — rici, Sep 17 '14 at 21:46
yep, forgot to mention, I do have a Mac OS X. grep -V outputs `grep (BSD grep) 2.5.1-FreeBSD` I will check out this answer, thnx! — user3241376, Sep 17 '14 at 21:56
@user3241376: Put a Perl incantation in the answer. Hope it works for you. I couldn't figure out how to do it with Python and I gave up before I got too frustrated. If I were you, I'd just install Gnu grep. — rici, Sep 18 '14 at 20:22
thanks for trying! I first also tried to do it with Python! But then I remembered that I actually have Ubuntu via VirtualBox! So I put your command there and it worked miraculously :) Thank you again ! — user3241376, Sep 18 '14 at 22:12

How to split a file containing non-ascii characters into words, in bash?

1 Answers1