-2

I have a file that contain a set of a few thousand unique words/terms. It looks like:

high school teacher
high school student
library
pencil stand
college professor
college graduate

I need the list of all repeated patterns, so in this case I would need the following as the result:

high
school
high school
college

Is there any way in unix/vim we could achieve this?

Additional elaboration on requirement:

Q. Do the repeats have to be on a single line, or can they be split over several lines?

  • Ideally, each pattern should be in a new line

Q. Are the patterns all word sequences (one or more words)

  • Yes they are all word sequences

Q. Does spacing matter within a line? Capitalization? Punctuation?

  • spaces and punctuations are all counted as part of the pattern. We can ignore capitalisation

ie.

  • School == School != school
  • this pat.tern == this pat.tern != this pattern
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
Josh Kurien
  • 143
  • 1
  • 13
  • Again - please [edit] your question to show your attempts to solve the problem yourself so we can best help you with that. – Ed Morton Jun 24 '21 at 23:20
  • @EdMorton I'm actually completely clueless on where to even start with trying to solve it. There was no good solution for finding repeated patterns within a line. All I could do was to make sure that each line is individually unique – Josh Kurien Jun 24 '21 at 23:25
  • 2
    Why do you need `high` and `high school` as repeated patterns but not `school`? Do the repeats have to be on a single line, or can they be split over several lines? Are the patterns all word sequences (one or more words)? What's the definition of "word"? Does spacing matter within a line? Capitalization? Punctuation? – Jonathan Leffler Jun 24 '21 at 23:27
  • @JonathanLeffler Thanks for the feedback, have edited the question to better elaborate on the requirements (also, you are right.. `school` is a repeated pattern too) – Josh Kurien Jun 24 '21 at 23:38
  • I think `awk` is probably the best tool for the job, unless you move to Python or Perl or one of the other scripting languages. – Jonathan Leffler Jun 24 '21 at 23:41
  • Making spaces part of the "pattern" makes this more challenging. So you're saying that you might have `foo bar` (one blank between the words) and `foo bar` (2 blanks between) in your input and those should not be counted as duplicates of each other, right? Please include a case like that in your sample input/output as that's a big one for anyone to test with. – Ed Morton Jun 24 '21 at 23:48
  • if `school teacher` existed on a separate line should it be included in the output too because it was also part of the `high school teacher` line? Include that in your example too, along with any other non-obvious cases you can think of. – Ed Morton Jun 24 '21 at 23:59

1 Answers1

2

This works for me (script placed in a file script.awk):

{
    for (i = 1; i <= NF; i++)
    {
        count[$i]++
        sequence = $i
        for (j = i + 1; j <= NF; j++)
        {
            sequence = sequence " " $j
            count[sequence]++
        }
    }
}
END {
    for (i in count)
    {
        if (count[i] > 1)
           print i
    }
}

The 'every line' code builds up the word sequences on the line and uses those to count the sequences. The END block loops through the sequences, printing those with a count of more than one (so the word sequence was repeated).

Given the (extended) data file (called data):

high school teacher
high school student
library
pencil stand
college professor
college graduate
coelacanths are ancient fish
coelacanths are ancient but still alive
coelacanths are ancient and long lived
coelacanths are ancient and can live to be 100 years old
coelacanths are ancient living fossils
coelacanths can live to be ancient
coelacanths are long-lived
coelacanths are slow to mature
coelacanths are denizens of the deep sea
coelacanths can be found off Africa and Indonesia

The output of awk -f script.awk data | sort is:

ancient
ancient and
and
are
are ancient
are ancient and
be
can
can live
can live to
can live to be
coelacanths
coelacanths are
coelacanths are ancient
coelacanths are ancient and
coelacanths can
college
high
high school
live
live to
live to be
school
to
to be

The data carefully has some longer repeated sequences of up to four words; longer word sequences would be tracked just as effectively.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • And if you do not know what a coelacanth is by now, here is the [wikipage](https://en.wikipedia.org/wiki/Coelacanth)! – kvantour Jun 25 '21 at 06:40
  • @kvantour — Thanks! Also, a Wired spin on the latest news about [coelacanths](https://www.wired.com/story/the-coelacanth-may-live-for-a-century-thats-not-great-news/). And from [The Guardian](https://www.theguardian.com/environment/2021/jun/18/mysterious-coelacanth-fish-can-live-for-100-years-study) — there were a number of other stories in the press about this in the last week or two. – Jonathan Leffler Jun 25 '21 at 14:17