Find common patterns across strings and group them based on the pattern

Question

How to automatically extract the common characters or common string from a set of input strings? is there an algorithm that does this?

I am trying to figure out how to parse 1000 input strings and automatically create groups of string based on the largest matching patterns.

Is there a library in ruby which does this?

Sample Input

What is your name?
Who wrote this book?
Your name starts with ABC
Is this book good?
Why is your name so long?
Have you read this book?



Expected Output.

your name
——————
What is your name?
Your name starts with ABC
Why is your name so long?

this book
————
Who wrote this book?
Have you read this book?
Is this book good?

Edited to clarify and fixed an error based on luqui's comment.

The case doesn't matter.

You need to specify what a "matching pattern" is. Is it the number of words in common, the size of substrings in common, or what? Does case or punctuation matter? Please clarify by editing your question. — Cary Swoveland, Jun 12 '15 at 18:29
This is an interesting algorithmic challenge; I'm not aware of any standard solutions. First one that comes to mind: make a dictionary of all substrings to their counts. So after the second line we'd have the dictionary `{ "W" => 2, "Wh" => 2, "Wha" => 1, "Who" => 1, "What " => 1, "Who w" => 1, "hat" => 1, "ho " => 1 }` and so on. Go through the list of strings counting up the substrings. Then at the end look for the largest values above a certain length (obviously the single characters will have the highest counts so you need some extra constraints) — luqui, Jun 12 '15 at 18:29
BTW surely "this book" would be the matched pattern for the second set. — luqui, Jun 12 '15 at 18:33

score -2 · Answer 1 · answered Jun 12 '15 at 18:24

-2

You can use core Ruby library:

["your name", "book"].map do |substring|
  [substring, text.lines.map(&:downcase).select { |line| line[substring] }]
end.to_h

# => {
#      "your name" => ["What is your name?", "Your name starts with ABC", ...],
#      "book" => [...]
#    }

answered Jun 12 '15 at 18:24

Andrew Kozin

1,012
9
17

2

I think OP wants to automatically identify "your name" and "book" – luqui Jun 12 '15 at 18:26
1

I don't think the OP wants to specify specific words or strings to match; rather, it is to group strings on some measure of commonality. – Cary Swoveland Jun 12 '15 at 18:27
@Andrew - I want to automatically identify common strings. I don't know "your name" and "book" beforehand – jumpa Jun 12 '15 at 21:27
I see. Then what you strategy would be? For example: 1) take all the existing words sequences (maybe excluding the first word in every line) 2) compact them and receive a dictionary of phrases 3) run my algorithm for all the dictionary (this would be quite slow ~ L^2 at the worst case) 4) sort them descending by the quantity of lines for every phrase 5) either take first x phrases, or take phrases with more than y lines I think this is a story about the algorithm you need, not about the selection of a library – Andrew Kozin Jun 12 '15 at 22:22

Find common patterns across strings and group them based on the pattern

1 Answers1

Linked