I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's// /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះវិញ្ញាណនឹងប្រពន្ធថ្មោងថ្មីពោលថា អញ្ជើញមក ហើយអ្នកណាដែលឮក៏ថា អញ្ជើញមកដែរ អ្នកណាដែលស្រេក នោះមានតែមក ហើយអ្នកណាដែលចង់បាន មានតែយកទឹកជីវិតនោះចុះ ឥតចេញថ្លៃទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!