Group by similar words

Question

Is there any way to group a table by a text field, having in count that this text field is not always exactly the same?

Example:

select city_hotel, count(city_hotel)
from hotels, temp_grid
where st_intersects(hotels.geom, temp_grid.geom)
and potential=1
and part=4
group by city_hotel
order by (city_hotel) desc

The output I get is the expected, for example, City name and count:

    "Vassiliki ";1
    "Vassiliki";1
    "Vassilias, Skiathos";1
    "Vassilias";5
    "Vasilikí";25
    "Vasiliki";23
    "Vasilias";1

But I'd want to group more this field, and get only one "Vasiliki" (or an array with all, this is not a problem) and a count of all the cells containing something similar between them.

I do not know if could this be possible. Maybe some function to text analysis or something similar?

Depends on the contents of the text field. How do you identify the "clusters", i.e. those different values that belong together? — dhke, Aug 25 '16 at 14:31
That is what I was expecting Postgres to do. find the pattern to match them, and concatenate. Would it be possible or is science fiction? — Pablo Pardo, Aug 25 '16 at 14:36
I don't have a Postgres instance handy, but I see that it supports soundex: https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html. Does that help? — arensb, Aug 25 '16 at 14:37
@ppardoz It's probably not science fiction, but it must likely also won't happen automatically. Remember, to the database those names are just a jumble of characters, it's only us that gives those things a meaning. I, for example, myself wouldn't be able to tell that `Vassiliki` and `Vassilias` should be grouped together. But you might want to lookup *stemming* in the context of information retrieval, which might help you, here. — dhke, Aug 25 '16 at 14:41
It's for sure not sci-fi. I have done it couple years ago with levenstein function from (mentioned before) fuzzystrmatch extension. All you need to do is to write your own, very simple though, function which allows you to enter percent of similarity as a parameter and at the end you'll have a very elegant and simple query. — Borys, Aug 25 '16 at 15:06
by the way, why the first record is twice on list? "Vassiliki ";1 and the next one is duplicated: "Vasilikí";25 "Vasiliki";23 — Borys, Aug 25 '16 at 15:09
That's because there is a space on the first one that is not in the second one. The Data is not being cleaned before it's being inserted into the database. This should be (if PHP just $var = trim($var);) — TolMera, Aug 25 '16 at 15:33

score 1 · Answer 1 · edited May 23 '17 at 12:01

SELECT COUNT(*), `etc` FROM table GROUP BY textfield LIKE '%sili%'; 

// The '%' is a SQL wildcard, which matches as many of any character as required.

You could do something like the above, choosing a word for the 'like' that best fits the spellings that your users have used.

Something that can help with that would be to do a

SELECT COUNT(*), textfield FROM table GROUP BY textfield ORDER BY textfield;

And selecting the most 'average' spelling for your words. Otherwise you're starting to get into a bit of language processing, and for that you will want to write some code outside of SQL.

This would be something like https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

To find word's that are the same within an arbitrary margin of error.

There is a MySQL implementation here that you should be able to transpose as needed https://stackoverflow.com/a/6392380/1287480

(credit https://stackoverflow.com/a/3515291/1287480)

.

(Personal thoughts on the topic)

You Really Really want to think about limiting the input from users that can give you this issue in the first place. It's far far better to give the users a list of places to select from, than it is to push potentially 'dirty' information into your database. That eventually always winds up with you trying to clean the information at a later time. A problem that has kept many people employed for many years.

We are working in limiting the input from users, but at the moment this is the data that I need to work with. Is a third party data, so I can not do anything more than try to clean to work with it. — Pablo Pardo, Aug 26 '16 at 06:44
That's understandable. I'm going to leave my comment there for future visitors to this question who may not be in the same position as you. Did this answer satisfy your needs? — TolMera, Aug 27 '16 at 16:47

Group by similar words

1 Answers1