How can I detect double letters in AWK? (telly -> tely in AWK) backreferences in the match term?

Question

I'm trying to turn telly into tely.

I've tried

awk 'BEGIN {f="telly" ;print gensub(/(.)\\1/,"\\1","g",f)}'

and

awk 'BEGIN {f="telly" ;print gensub(/(.)\1/,"\\1","g",f)}'

but getting telly still

I'm pretty sure I can do this* (*backreferences in the match expression) in sed probably perl too. But I'm writing functions in awk as it makes processing multi-column data simpler than hacking out the columns in sed

for example I am doing different processes on a lexicon I'm working with

here is an example of some failed output. the third column of connoisseur should not have double s or n.

otto ottô ottô o-tt--ô ottô 11025
hindu hindü hindö hind--ü hndü 11250
wearily weárílý weérélê weáríl--ý wrlý 11251
nora nørá nøré nør--á nrá 11252
formulate før#mûlâtè fømûlât før#mûlât--è fr#mltè 11253
embryo embrýô embrêô e-mbr--ýô embrýô 11254
stylish stŷliŝħ stîliŝ stŷliŝ--ħ stlŝħ 11255
eruption ėrupţìòn irupŝn ė-rupţìòn ėrpţn 11256
authoritarian auπħorítã#rïán auπoréte#rêén au-πħorítã#rïán auπrt#rn 11258
untouched untóùĉħèð untéĉt u-ntóùĉħèð untĉð 11425
penry penrý penrê penr--ý pnrý 11625
maze mâzè mâz mâz--è mzè 11725
forge før#ĝè føj før#ĝ--è fr#ĝè 11825
ferrari fèŕrārï fŕrārê fèŕrār--ï frrï 12511
assailant ássâìlánt éssâlént á-ssâìlánt ásslnt 25011
corrosive còŕr0ôsivè cŕôsiv còŕr0ôsiv--è cr0svè 25111
daimler dâìmlèŕ dâmlŕ dâìml--èŕ dmlèŕ 25311
connoisseur connoíssèùŕ connoéssŕ connoíss--èùŕ cnnssèùŕ 25511
airframe ãìŕfrâmè eŕfrâm ãìŕ-frâm--è ãìŕfrmè 25911
ampersand ampèŕsand ampŕsand a-mpèŕsand ampsnd 62511

the input is 3 or four columns per line and I want to process it field by field rather than line by line. Hence the use of awk.

just for info here is a tiny snippet of the input

,"accepted","acçeptėd","1118"
,"ellis","ellis","7111"
,"woollen","wōòllén","11111"
,"hurricane","hurrícânè","11113"
,"fuelled","fûéllèd","11114"
,"groom","gröòm","11132"
,"preferring","prėfèŕriñg0","11134"
,"uttered","uttèŕèd","11138"
,"surrendered","sùŕr0endèŕèd","11141"
,"differentiate","différenţïâtè","11145"
,"exceeding","ėxc0êèdiñg0","11146"
,"groove","gröòvè","11148"
,"floppy","floppý","11163"
,"butterflies","buttèŕflîèś","11165"
,"ee","êè","11167"
,"cartoon","cār#töòn","11170"
,"slapped","slappèð","11172"
,"scattering","scattériñg0","11178"
,"jubilee","jübílêè","11179"
,"buzzing","buzziñg0","16111"
,"whipping","wħippiñg0","19111"
,"missus","missμś","21111"
,"corrosive","còŕr0ôsivè","25111"
,"alluring","állūriñg0","31110"
,"confidentially","confídenţìállý","34111"
,"antenna","antenná","35111"
,"whoosh","wħöòŝħ","41114"
,"fattened","fatténèd","49111"
,"cobble","cobblè","61116"

here is the final lines in the awk file I'm using. It uses functions directly on the fields this is why I am using awk. The third column in the output has a disambiguate function. I had put a gensub in that function that I was trying to use to 'singl-ify' the double letters with.

some code with functions in . . .

BEGIN {FS= "\"" }
{print $2,$4,disambiguate($4),isolate_terminal_vowels($4),devowelCentre(isolate_terminal_vowels($4)),$6}

thx

The fourth bird · Answer 1 · 2021-11-22T20:50:36.153

3

Reading this page awk does not support back references.

To leave alone the whitespace chars (if those are the field separators) you can use sed and match a non whitespace char followed by a backreference

sed -E 's/([^[:space:]])\1/\1/g' file

edited Nov 22 '21 at 20:50

answered Nov 22 '21 at 20:31

The fourth bird

154,723
16
55
70

Yes, I think I tried doing similar stuff to this with sed some time back but I was creating massively long regex-s to isolate the columns that I wanted to deal with and this was clumsy. – Tobe Nov 22 '21 at 21:16
Excellent. However `aaa` is replaced by `aa`. If characters can be repeated more than once I would suggest `sed -E ':a;s/([^[:space:]])\1/\1/g;ta'` that condenses any repetition to only one occurrence. With GNU sed: `sed -E ':a;s/(\S)\1/\1/g;ta'`. – Renaud Pacalet Nov 23 '21 at 06:23
1

@RenaudPacalet Thank you for your comment, you could indeed solve it like that. I think repeating the backreference 1+ times can also work to leave only a single occurrence right? `sed -E 's/([^[:space:]])\1+/\1/g' file` – The fourth bird Nov 23 '21 at 08:54
1

@Thefourthbird Absolutely. Moreover, I would not be surprised if your solution was more efficient than my label/loop. – Renaud Pacalet Nov 23 '21 at 08:59

Ed Morton · Answer 2 · 2021-11-23T11:02:39.483

Awk does not support backreferences in a regexp, here's how you'd do what you want to do in awk (updated based on new input and additional information provided):

$ cat tst.awk
function compress(oldStr,       newStr,lgth,charPos,char,seen,regexp,string) {
    newStr = oldStr
    lgth = length(oldStr)
    for (charPos=1; charPos<lgth; charPos++) {
        char = substr(oldStr,charPos,1)
        # for letters only: if ( (char ~ /[[:alpha:]]/) && !seen[char]++ ) {
        if ( !seen[char]++ ) {
            regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) "+"
            string = ( char == "&" ? "\\" : "" ) char
            gsub(regexp,string,newStr)
        }
    }
    return newStr
}

BEGIN { FS=OFS="\"" }
{
    for (i=2; i<NF; i+=2) {
        $i = compress($i)
    }
    print
}

$ awk -f tst.awk file
,"acepted","acçeptėd","18"
,"elis","elis","71"
,"wolen","wōòlén","1"
,"huricane","hurícânè","13"
,"fueled","fûélèd","14"
,"grom","gröòm","132"
,"prefering","prėfèŕriñg0","134"
,"utered","utèŕèd","138"
,"surendered","sùŕr0endèŕèd","141"
,"diferentiate","diférenţïâtè","145"
,"exceding","ėxc0êèdiñg0","146"
,"grove","gröòvè","148"
,"flopy","flopý","163"
,"buterflies","butèŕflîèś","165"
,"e","êè","167"
,"carton","cār#töòn","170"
,"slaped","slapèð","172"
,"scatering","scatériñg0","178"
,"jubile","jübílêè","179"
,"buzing","buziñg0","161"
,"whiping","wħipiñg0","191"
,"misus","misμś","21"
,"corosive","còŕr0ôsivè","251"
,"aluring","álūriñg0","310"
,"confidentialy","confídenţìálý","341"
,"antena","antená","351"
,"whosh","wħöòŝħ","414"
,"fatened","faténèd","491"
,"coble","coblè","616"

Original answer:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j++) {
            char = substr($i,j,1)
            if ( !seen[char]++ ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) "+"
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

$ awk -f tst.awk file
satelites satélîtès satélîts satélîtès stlts 1257
marginaly mār#ĝínálý mājénélê mār#ĝínál-ý mr#ĝnlý 1252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl-ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 1257

See https://stackoverflow.com/a/29626460/1745001 for why I special-case \ and ^ when making each char literal while creating the regexp and why I escape & for the replacement string before calling gsub().

Note that I'm doing the above field by field because you specifically said I want to process it field by field rather than line by line - it'd obviously be briefer and more efficient to do it a whole line at a time.

If you truly only want to operate on letters (not numbers or punctuation) then change this:

if ( !seen[char]++ ) {

to this:

if ( (char ~ /[[:alpha:]]/) && !seen[char]++ ) {

For example note how the numbers and dashes aren't compressed in the output below:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j++) {
            char = substr($i,j,1)
            if ( (char ~ /[[:alpha:]]/) && !seen[char]++ ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) "+"
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

satelites satélîtès satélîts satélîtès stlts 11257
marginaly mār#ĝínálý mājénélê mār#ĝínál--ý mr#ĝnlý 12252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl--ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 12557

Thanks. Does this work for all single back reference groups not just the double letter case? — Tobe, Nov 23 '21 at 17:19
I'm sorry, I don't know what `all single back reference groups` means. It'll work for any repetitions of any character, and I showed how to make it work only for letters if you like. — Ed Morton, Nov 24 '21 at 13:08

score 1 · Answer 3 · answered Nov 22 '21 at 20:24

1

What version of awk are you using? Some versions don't support backreferences.

You could consider using Perl with the -p (read a file and print $_ after each line) and -F (split each line into the @F list) flags to get awk-like behavior:

# cat test.txt
fo foo ffooo
bar baar bbaaarrrr
# perl -pF=' ' -e 'for (@F) {s/(.)\1/$1/g}; $_ = join(" ", @F)' test.txt
fo fo foo
bar bar baarr

answered Nov 22 '21 at 20:24

plentyofcoffee

478
2
11

3

FYI no awk version supports backreferences in the regexp. – Ed Morton Nov 22 '21 at 20:27
thanks maybe Perl would be more suitable. the -F looks functionally quite similar to what I was hoping to benefit from when using awk. – Tobe Nov 22 '21 at 22:22

score 1 · Answer 4 · answered Nov 23 '21 at 18:38

I'm trying to turn telly into tely.

This sound like task for tr -s, combining with few other tools will allow applying to selected column, let file.txt content be

,"fritz","fritz","123","456"
,"otto","otto","789","123"
,"sssnake","sssnake","456","789"

and aim is to remove repeated letters from 2nd words, then

cut -d '"' -f1-3 file.txt > file1.txt
cut -d '"' -f4 file.txt > file2.txt
cut -d '"' -f5- file.txt > file3.txt
cat file2.txt | tr -s [:alpha:] | paste -d '"' file1.txt - file3.txt > finalfile.txt

produces finalfile.txt

,"fritz","fritz","123","456"
,"otto","oto","789","123"
,"sssnake","snake","456","789"

Explanation: first cut into three files: one with column of interest (file2.txt) and 2 other files with columns before (file1.txt) and columns after (file3.txt), then apply tr's squeezer with set being letter ([:alpha:]) then use paste to glue together unchanged file1.txt, altered file2.txt (feed as standard input, thus -) and unchanged file3.txt. This way you do not have to create own function for that in AWK, but you need place to store (temporary) files.

score 0 · Answer 5 · answered Nov 23 '21 at 17:02

Could one do it with a recursive function?

compress.awk

function compress(str,prev_let_1){ #string to be compressed and previous first letter
  let_1   = substr(str,1,1)        #first letter of str
  remndr  = substr(str,2)           #remainder  
  
  if (remndr=="") 
   { 
    if (prev_let_1==let_1) let_1=""; 
    return let_1
   }

  if (prev_let_1==let_1) return compress(remndr,let_1);
  
  return let_1 compress(remndr,let_1) 
}

BEGIN {FS= "\"";  }
{print $2,compress($2,""), $4, compress($4),$6}

with the following compress.data

,"application","applícâţìòn","1000"
,"officers","offíçèŕś","1000"
,"route","röùte--HETERONYM--#rõùte","1000"
,"routexx","röùte","1000"
,"routeyy","#rõùte","1000"
,"wind","wind--HETERONYM--wînd","1000"
,"winds","windś--HETERONYM--wîndś","1000"
,"windsxx","windś","1000"
,"windsyy","wîndś","1000"
,"windxx","wind","1000"
,"windyy","wînd","1000"
,"degree","dėgrêè","1005"
,"effective","ėffectivè","1008"
,"scottish","scottiŝħ","1009"
,"comfortless","cőmfòŕtless","10000"
,"cupped","cuppèd","10000"
,"footmen","fōòtmen","10000"
,"hatless","hatléss","10000"
,"lullaby","lullabŷ","10000"
,"marry","marrý","10000"
,"nearness","nėàŕness","10000"
,"ridinghood","rîdiñg0hōòd","10000"
,"unhappiest","unhappïést","10000"
,"uninterrupted","unintérruptėd","10000"
,"wonderfuller","wőndèŕfullèŕ","10000"
,"woodmouse","wōòdmõùsè","10000"
,"pulls","pūllś","10005"
,"wellington","welliñg0tón","10007"
,"sufferer","sufférèŕ","10012"
,"communion","cómmûnĩón","10016"
,"loneliness","lônèlïnéss","10017"
,"wallet","wållét","10022"
,"unmarried","unmarrìêd","10032"
,"pill","pill","10038"
,"shoots","ŝħöòts","10039"
,"sierra","sïerrá","10048"
,"critically","criticàllý","10050"
,"puzzle","puzzlè","10065"
,"fatty","fattý","10076"
,"finn","finn","10083"
,"exceeds","ėxc0êèdś","10086"
,"undertook","undèŕtōòk","10089"
,"laterally","latèrállý","11000"
,"gillian","ĝillïán","11008"
,"unaffected","unáffectėd","11009"
,"corrugated","cør#rugâtėd","21002"
,"maggots","maggóts","24100"
,"buffs","buffs","30100"
,"massaging","mássāg2iñg0","31003"
,"stott","stott","32100"
,"beretta","bérettá","43100"
,"installer","instållèŕ","51001"
,"willowy","willówý","51004"
,"sisterhood","sistèŕhōòd","51008"
,"snuggling","snuggliñg0","61000"
,"impermissible","impèŕmissiblè","71005"
,"imbedded","imbeddėd","72100"
,"haswell","haswell","81008"
,"netter","nettèŕ","91003"
,"cc","www","5000"
,"ccitric","wwwhisper","5000"

running

cat compress.data | awk -f compress.awk > compress.out

I get the following output compress.out

application aplication applícâţìòn aplícâţìòn 1000
officers oficers offíçèŕś ofíçèŕś 1000
route route röùte--HETERONYM--#rõùte röùte-HETERONYM-#rõùte 1000
routexx routex röùte röùte 1000
routeyy routey #rõùte #rõùte 1000
wind wind wind--HETERONYM--wînd wind-HETERONYM-wînd 1000
winds winds windś--HETERONYM--wîndś windś-HETERONYM-wîndś 1000
windsxx windsx windś windś 1000
windsyy windsy wîndś wîndś 1000
windxx windx wind wind 1000
windyy windy wînd wînd 1000
degree degre dėgrêè dėgrêè 1005
effective efective ėffectivè ėfectivè 1008
scottish scotish scottiŝħ scotiŝħ 1009
comfortless comfortles cőmfòŕtless cőmfòŕtles 10000
cupped cuped cuppèd cupèd 10000
footmen fotmen fōòtmen fōòtmen 10000
hatless hatles hatléss hatlés 10000
lullaby lulaby lullabŷ lulabŷ 10000
marry mary marrý marý 10000
nearness nearnes nėàŕness nėàŕnes 10000
ridinghood ridinghod rîdiñg0hōòd rîdiñg0hōòd 10000
unhappiest unhapiest unhappïést unhapïést 10000
uninterrupted uninterupted unintérruptėd unintéruptėd 10000
wonderfuller wonderfuler wőndèŕfullèŕ wőndèŕfulèŕ 10000
woodmouse wodmouse wōòdmõùsè wōòdmõùsè 10000
pulls puls pūllś pūlś 10005
wellington welington welliñg0tón weliñg0tón 10007
sufferer suferer sufférèŕ suférèŕ 10012
communion comunion cómmûnĩón cómûnĩón 10016
loneliness lonelines lônèlïnéss lônèlïnés 10017
wallet walet wållét wålét 10022
unmarried unmaried unmarrìêd unmarìêd 10032
pill pil pill pil 10038
shoots shots ŝħöòts ŝħöòts 10039
sierra siera sïerrá sïerá 10048
critically criticaly criticàllý criticàlý 10050
puzzle puzle puzzlè puzlè 10065
fatty faty fattý fatý 10076
finn fin finn fin 10083
exceeds exceds ėxc0êèdś ėxc0êèdś 10086
undertook undertok undèŕtōòk undèŕtōòk 10089
laterally lateraly latèrállý latèrálý 11000
gillian gilian ĝillïán ĝilïán 11008
unaffected unafected unáffectėd unáfectėd 11009
corrugated corugated cør#rugâtėd cør#rugâtėd 21002
maggots magots maggóts magóts 24100
buffs bufs buffs bufs 30100
massaging masaging mássāg2iñg0 másāg2iñg0 31003
stott stot stott stot 32100
beretta bereta bérettá béretá 43100
installer instaler instållèŕ instålèŕ 51001
willowy wilowy willówý wilówý 51004
sisterhood sisterhod sistèŕhōòd sistèŕhōòd 51008
snuggling snugling snuggliñg0 snugliñg0 61000
impermissible impermisible impèŕmissiblè impèŕmisiblè 71005
imbedded imbeded imbeddėd imbedėd 72100
haswell haswel haswell haswel 81008
netter neter nettèŕ netèŕ 91003
cc c www w 5000
ccitric citric wwwhisper whisper 5000

How can I detect double letters in AWK? (telly -> tely in AWK) backreferences in the match term?

5 Answers5