10

Lets say "textfile" contains the following:

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?



Some notes:

  • This is not a homework question.

  • The simpler case when the words should be determined only by spaces, is easy. Just writing:

    for i in `cat textfile`; do echo $i; done;
    

    will do the trick, and return:

     lorem$ipsum-is9simply
     the.dummy
     text%of-printing
    

    For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.

  • Here are the two related Q&As I found
    How do I split a string on a delimiter in Bash?
    How to split a line into words separated by one or more spaces in bash?

Community
  • 1
  • 1
Sv1
  • 2,414
  • 2
  • 18
  • 13

2 Answers2

22

Use the tr command:

tr -cs 'a-zA-Z0-9' '\n' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumeric characters (maybe add _ too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):

tr -cs '[:alnum:]' '\n' <textfile
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Perfect, this is exactly what I was after. Thanks! (I'm sorry I don't have enough reputation to vote up your answer) – Sv1 Sep 24 '10 at 23:03
  • 2
    @Sv1: You will probably have a high reputation soon. I voted your question up because of how well you documented what you wanted and for all the research you had done on it. – grok12 Jun 26 '11 at 18:00
  • What if you have decimal numbers? – Leyu Mar 20 '12 at 06:49
  • 1
    @Leyu: add the extra characters to the set that is retained: `tr -cs '[:alnum:]+-.' '\n' < textfile`. Of course, that will allow through full stops and ellipsis and dashed lines, etc. But it will also allow through +1.23 and -1.24e-23, etc. – Jonathan Leffler Mar 20 '12 at 09:13
3
$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split($0, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}
DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
  • thanks Ross! this is pretty cool, I've been meaning to get into the awk-universe :) – Sv1 Sep 28 '10 at 05:40