1

My table looks something like this:

| id (int) | sentence (varchar) |

I want to find all rows that are almost the same except for one particular word. Eg:

| 230 | test |
| 321 | test sth |
...
| 329 | is (sth) it?
| 923 | is it?

The word that can be different is sth in this case. Ideally I could use some sort of "array" with the list of words that can be different.

Is this something I could do purely in SQL?

Cornwell
  • 3,304
  • 7
  • 51
  • 84
  • 1
    full text search and sort by relevance? basic sql is totally crappy for "fuzzy" string matching like this. – Marc B Apr 10 '14 at 15:22
  • FYI: There is a link to the official [MySQL regex documentaiton](https://dev.mysql.com/doc/refman/5.1/en/regexp.html) in the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), listed under "General Documentation > Official documentation for specific flavors > Databases". – aliteralmind Apr 10 '14 at 15:33
  • Does "the same except for one particular word" mean "the particular word can be present or absent" or does it mean "the particular word might be different / misspelled / replaced with any other word but the rest of the sentence stays the same" ? – TessellatingHeckler Apr 10 '14 at 16:26

2 Answers2

0

Just an untested quick shot, sorry, but I think you could do something like

SELECT * FROM table GROUP BY REPLACE(text, 'sth', '')
Jens
  • 25,229
  • 9
  • 75
  • 117
  • Assuming, of course, that you want to remove *every* occurrence of `'sth'` from *every* record. Plus your select list should probably aggregate over `id`, e.g. `GROUP_CONCAT(id)`. – eggyal Apr 10 '14 at 15:28
  • This query should not modify anything. What do you mean by "remove"? – Jens Apr 10 '14 at 17:45
0

You can use SOUNDEX. So with the examples that you gave, these queries:

SELECT SOUNDEX('test')
SELECT SOUNDEX('test sth')
SELECT SOUNDEX('is (sth) it?')
SELECT SOUNDEX('is it?')

return these results:

T230
T230
I200
I200

That means that the first two and the second two sound like each other. What I can't be sure of is how well this will work with your actual data, you're just going to have to try it.

Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232