0

in my DB i have a table with brand names, and I need to clean it up since i have some duplicates like

Yadda

Yadda & Co

Yadda Engineering

which are the same brand but have been entered by mistake multiple times despite the column is unique

how to approach this? can this be done in mysql query only?

thanks

Community
  • 1
  • 1
Francesco
  • 24,839
  • 29
  • 105
  • 152
  • depends how bizarro the entries are that supposedly match. – Drew Oct 16 '16 at 07:09
  • And you have companies "American General" and "American Standard" and you somehow want to magically handle all this right. Good luck with that one. – Drew Oct 16 '16 at 07:19
  • the best solution to this problem is to avoid the duplication of brand names. U can use select option in html to list the possible brands. (A bit difficult to find and list all brands) – jophab Oct 16 '16 at 07:25

3 Answers3

1

My offer first create FullText index for search column. After with full text score you can determine similarity of strings. Delete them unless which have greater score and greate than threshold.

DELETE FROM table_name 
WHERE
    id <> (SELECT id FROM table_name ORDER BY MATCH (title) AGAINST ('Yada' IN NATURAL LANGUAGE MODE) DESC LIMIT 0,1) AND
    MATCH (title) AGAINST ('Yada' IN NATURAL LANGUAGE MODE) < threshold

Also instead off Fulltext you can use levenshtein function in this answer (with same logic sql query) how to compute similarity between two strings in MYSQL

Community
  • 1
  • 1
Ramin Darvishov
  • 1,043
  • 1
  • 15
  • 30
0

You can use substr

assuming that your rows begin all with the words 'Yadda' you can

 select * from my_table 
 where substr(my_column, 1, length('Yadda')) = 'Yadda');

Then if you need delete the improper rows you can (assuming that you want save the rows with 'Yadda & Co' you could

delete from my_table
where substr(substr(my_column, 1, length('Yadda')) = 'Yadda')
and my_column <> 'Yadda & Co';

or for American

delete from my_table
where substr(substr(my_column, 1, length('American')) = 'American')
and my_column not in  ('American Standard', 'American Airlines');
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
0

SELECT Name FROM Table WHERE Name REGEXP '^[Y].*$'

In this REGEXP stands for regular expression

and

this is for T-SQL

SELECT Name FROM Table WHERE Name LIKE '[Y]%'