Ok, so I am working on an assignment for a course in my linguistics BA where we are working with python to process texts. This is what I needed to do:
Create a script that counts trigrams frequencies
- Do not add dummy tokens
- Lowercase every token and concatenate trigram units with an underscore
- What are the missing values in the output box?
- Bonus: Try to solve the task by storing trigrams a tuples in the dictionary
This is how I solved most if it:
lyrics = "Do you remember 21st night of September ? Love was changing the mind of pretenders While chasing the clouds away Our hearts were ringing In the key that our souls were singing As we danced in the night Remember how the stars stole the night away yeah yeah yeah Hey hey hey Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day Ba duda ba duda ba duda badu Ba duda badu ba duda badu Ba duda badu ba duda yeah My thoughts are with you Holding hands with your heart to see you Only blue talk and love Remember how we knew love was here to stay Now December Found the love we shared in September Only blue talk and love Remember the true love we share today Hey hey hey Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya golden dreams were shiny days Now our bell was ringing aha Our souls was singing Do you remember every cloudy day yau There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya golden dreams were shiny days Ba de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya de ya"
lyric = lyrics.lower()
listText = lyric.split(" ")
freq = {}
while len(listText) > 2:
trigram = (listText[0], listText[1], listText[2])
if trigram in freq.keys():
freq[trigram] += 1
else:
freq[trigram] = 1
listText.pop(0)
sorted_data = sorted(freq.items() , key=lambda x: x[1], reverse = True)
for entry in sorted_data:
print(str(entry[0])+"\t"+str(entry[1]))
The only part I am missing is to concatenate the trigram units with an underscore. It's supposed to be so simple, but I can't for the life of me find out how to make it happen. The output is supposed to be the concatenated trigrams followed by the frequency of said trigram. The teacher said it can be solved so easily, but I can't figure it out. Which is funny, because everything else I did here was super quick and easy (relatively).
I have tried many things, but for some reason, I can't get it to work.