4

Is there a shell script that runs on a mac to generate a word list from a text file, listing the unique words? Even better if it could sort by frequency....

sorry forgot to mention, yeah i prefer a bash one as i'm using mac now...

oh, my file is in french... (basically i'm reading a novel and learning french, so i try to generate a word list help myself). hope this is not a problem?

casperOne
  • 73,706
  • 19
  • 184
  • 253
athos
  • 6,120
  • 5
  • 51
  • 95

3 Answers3

3

If I understood you correctly, you need something like that:

cat <filename> | sed -e 's/ /\n/g' | sort | uniq -c
kofemann
  • 4,217
  • 1
  • 34
  • 39
  • You can probably dig into `sed`'s regex documentation to make the script a little more comprehensive, but that's how I'd do it, except you have to do hackery to actually get a newline, a la http://stackoverflow.com/a/7567839/4203 – Hank Gay Apr 30 '12 at 15:38
  • @athos See my comment. This is because you're on a Mac (I am, too), and I'm pretty sure Mac doesn't ship GNU `sed` (it's BSD-based). – Hank Gay Apr 30 '12 at 15:41
  • @athos It's easy enough to work around. In fact, if you mix-and-match the two answers already posted (to use `tr` instead of `sed`), you get `$ cat $YOUR_FILE | tr ' ' '\n' | sort | uniq -c`, and I think that gets you exactly what you want. – Hank Gay Apr 30 '12 at 15:46
2

This command will do

cat file.txt |  tr "\"' " '\n' | sort -u

Here sort -u will not work on Macintosh machines. In that case use sort | uniq -c instead. (Thanks to Hank Gay)

cat file.txt |  tr "\"' " '\n' | sort | uniq -c 
Community
  • 1
  • 1
Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
  • I think `sort | uniq -c` is a better option for finishing it off, because `sort -u` (on a Mac) will not show the words, just the counts. – Hank Gay Apr 30 '12 at 15:44
  • it works! but need a few modification... 1. no word count 2. not writing directly to an output file 3. need replace not only ( ) but also ('), but thanks this is the way to go! – athos Apr 30 '12 at 15:49
  • how could i replace the double and single quotes with new line? – athos Apr 30 '12 at 15:54
  • thank you guys. i used a slight different way: tr -cs "[:alpha:]" "\n" < file.txt | sort | uniq -c – athos Apr 30 '12 at 16:05
  • @HankGay: what version of Mac OS X? On my machine (OS 10.7.4), `/usr/bin/sort` is from GNU coreutils 5.93, and `-u` produces words, not counts. – chepner Jun 13 '12 at 15:39
  • @chepner I'm on Snow Leopard at the moment, although my `sort` version appears to match yours. Very strange. – Hank Gay Jun 13 '12 at 18:42
1

Just answer my question to dot down the final version i'm using:

tr -cs "[:alpha:]" "\n" < FileIn.txt | sort | uniq -c | awk '{print $2","$1}' >> FileOut.csv

some notes:

  • tr can be used directly to do replacement.
  • since i'm interested creating a word list for my french vocabulary, i used [:alpha:]
  • awk is used to insert a comma, so that the output is a csv file, easier for me to upload...

thanks again for everyone helping me.

sorry i didn't put it clearly at the beginning that i'm using a mac and expect a bash script.

cheers.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
athos
  • 6,120
  • 5
  • 51
  • 95