1

I have implemented a custom ruby method which groups similar text using loops,

 array = ["South East Queensland", "Wide Bay Burnett", "Margaret River", "Port Pirie", "Gippsland", "Elizabeth", "Barossa"]
similarity_group = []
similarity_percentage = 60.0

array.each do |first_text|
  results.each do |second_text|
    result = first_text.upcase.similar(second_text.upcase)
     if result >= similarity_percentage
      ...
      ...
      ...
    end  
  end
end

consider the above implementation for 2000 element,then to group them it will cost 4000000 iteration because, each element will check with each other.

is there any performant solution or gems or library like grouping a bulk array based on their similarity.

(I need to use the same array element for similarity check)

sample expectation : [array1].similarity([array1])

Ferdy
  • 666
  • 7
  • 25
  • Where does `similar` come from? Is it from a gem? Which one? – Eric Duminil Jan 30 '17 at 13:53
  • Note : You use `first_text` twice. I suppose similarity will always be 1 :) One obvious optimization would be to check 2 strings only if `string1 >= string2`. It cuts the iterations in half. – Eric Duminil Jan 30 '17 at 13:56

2 Answers2

4

I think you're looking for clustering based on string distance (e.g. Levensthein). From your code, it looks like similar is already implemented, so it must be clear which distance you're considering.

For clustering, these two gems could help.

The problem is hard though, so your O(n**2) is actually not so bad in comparison. You could avoid comparing string pairs twice by just checking that string1 > string2 before calculating the distance. You'd need (2000*1999)/2 iterations instead of 2000**2.

Community
  • 1
  • 1
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • Thanks @Eric Duminil, yes similar is a gem which helps to find similarity percentage between two string. and sorry it's a typo. I updated my question. – Ferdy Jan 30 '17 at 17:54
1

I have a matrix solution I used in the past to compare ebooks, since these are quite large and take time to process I solved that in the first place with a matrix, later on I used an index on a value I derived from the books which was as accurate and much faster. I could search the matrix solution for you but I have another simple routine based on levenshtein which looks much as what you want to accomplish, creating a similarity array (a hash here).

I publish the working version of my script as in my scriptlibrary so you'll probably want to adapt some of it. I used it as an answer previously on this question. You could eg build an array in the same way.

require 'levenshtein'

MAX_DISTANCE, COMPENSATION = 3, 5

strings = [
    "Hazard Const. Company",
    "hazard construction company",
    "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC",
    "peterson-chase general  engineering construction inc",
    "TRAFFIC DEVELOPMENT SERVICES ",
    "traffic development services"
]

result = {}
strings.each do |s|
    s.downcase!
  similar = result.keys.select { |key| Levenshtein.distance(key, s) < MAX_DISTANCE+(s.length/COMPENSATION) }
  if similar.any?
    result[similar.first] += 1
  else
    result.merge!({s => 1})
  end
end

p result
# {"hazard const. company"=>2, "peterson-chase general engineering construction inc"=>2, "traffic development services "=>2}
Community
  • 1
  • 1
peter
  • 41,770
  • 5
  • 64
  • 108