4

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).

I am using this bash script in cygwin:

#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
    -e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
    -e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
    -e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
    -e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
  tr [:upper:] [:lower:] | \
  sort | \
  uniq -c | \
  sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'

Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?

Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):

ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ ឥត​ចេញ​ថ្លៃ​ទេ។

Here is an example output of frequency.txt as it is now (frequency and then term):

25605 នឹង 25043 ជា 22004 បាន 20515 នោះ

I want the output frequency.txt to look like this (where TAB is an actual tab character):

25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ

Thanks for your help!

Community
  • 1
  • 1
Nathan
  • 1,483
  • 3
  • 18
  • 41
  • 1
    It would be better if you provide a sample of what the corpus file `dictionary.txt` looks like because I'm nearly certain you could replace your entire script with a single call to `awk`. i.e. there would be no use of `tr, sort, uniq, sed, or grep` – SiegeX Feb 01 '11 at 00:33
  • I added a sample of the dictionary text file in the original question - thanks! – Nathan Feb 01 '11 at 00:43
  • 1
    @Nathan What encoding is that dictionary file using? It looks like trash on my screen: see http://i.imgur.com/Ao82s.png – SiegeX Feb 01 '11 at 00:48
  • @SiegeX it is in UTF-8 - Khmer Unicode isn't supported by a ton of stuff yet. – Nathan Feb 01 '11 at 00:53
  • 1
    @Nathan is `dictionary.txt` just a bunch of words with potentially more than one one a line separated by white space? Or is it a list of words, one word per line? – SiegeX Feb 01 '11 at 00:54
  • @SiegeX it has both, a bunch of words and sometimes one word per line. Basically it is just a ton of Khmer writing pasted into one document (some from books, some from dictionaries etc.). It also has some English mixed in with it, hence the regex to remove English characters and some punctuation (both Khmer and English). – Nathan Feb 01 '11 at 00:59

3 Answers3

3

You should be able to replace the whole lengthy sed command with this:

tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '

Comments:

  • 's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
  • 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
  • 's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines

You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:

sed 's/^ *\([0-9]\+\) /\1\t/'

If you want the AWK command to output a tab:

awk 'BEGIN{OFS='\t'} {print $2, $1}'
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • Thank you Dennis. I am having trouble with adding tabs as you said, when I add sed 's/^ *\([0-9]\+\) /\1\t/' after the uniq the script stops at the end and never populates my frequency.txt. Am I understanding you correctly that I just add sed 's/^ *\([0-9]\+\) /\1\t/' one line below the uniq? – Nathan Feb 01 '11 at 04:00
  • 1
    @Nathan: Yes, you need to add the necessary pipe character(s). Just as you have `uniq -c | \\` now, you would need `sed ... | \\` (actually the line-continuation backslashes aren't needed since the pipe does line-continuation for you). – Dennis Williamson Feb 01 '11 at 04:08
  • Thanks Dennis, I've never worked with a bash script, so I was unfamilier with the syntax. Thanks for taking time to help me on this! – Nathan Feb 01 '11 at 05:15
1

What about writing awk to file with "<"?

Ratinho
  • 298
  • 1
  • 6
  • Yes, that did work, but it is not ideal because now there is not on-screen status of anything going on (besides my cpu monitor I wouldn't know it was working). Is there another way? Thank you for this - at least it is possible. – Nathan Feb 01 '11 at 00:45
1

The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile

#!/bin/sh  

sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
   END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile
SiegeX
  • 135,741
  • 24
  • 144
  • 154