0

I am taking the MySQL class by Duke on Coursera. In week two there is reference to messy data and I figured I would ask my question here. The scope of my question would be how to compare an entry in a row of table that would already match an instance except that it was entered with a hyphen, i.e. "Golden Retriever Mix" is the same instance as "Golden Retriever- Mix". And when I run a DISTINCT SELECT statement I do not want it to pull both results. The catch is, we cannot just remove all hyphens from the column fields because we still want them for instance for entry of "Golden Retriever-Airedale Terrier Mix". How would a query look for doing this. The example code that pulls in both "Golden Retriever Mix" and "Golden Retriever- Mix" is below.

SELECT DISTINCT breed,
TRIM(LEADING '-' FROM breed)
FROM dogs
ORDER BY (LEADING '-' FROM breed) LIMIT 1000, 1000;

I am thinking I need and IF/THEN statement that says IF(REPLACE(breed,'-','') = breed) THEN DELETE breed;

Obviously this is not correct syntax which is what I am looking for.

Brian Wiley
  • 485
  • 2
  • 11
  • 21
  • Where you've got more than one that are the same (after replacing hyphens) - do you mind which variant you keep? – Bridge May 10 '17 at 07:47

2 Answers2

0

You're almost there, all you need to do is get rid of the plain breed column in your select clause and change TRIM() with REPLACE()

SELECT  DISTINCT REPLACE(breed, '-', ' ')
FROM    dogs

TRIM(LEADING...) would remove the hyphens at the beginning of the string, but what you want to show is the distinct values of breed considering hyphens as spaces.

Edit

I was assuming the two strings were "Golden Retriever Mix" and "Golden Retriever-Mix", but if there's actually a space after the hyphen ("Golden Retriever- Mix"), you can use REPLACE(breed, '-', '') instead

Edit 2

After the clarification in your comment, I think what you need is a GROUP BY clause

SELECT  MIN(breed)
FROM    dogs
GROUP BY REPLACE(breed, '-', ' ')

Any string with an hypen will be considered higher in value than the same string with a space instead, so when there are both this query will return the one with the space. If there's only one instead, it will be returned as is

Stefano Zanini
  • 5,876
  • 2
  • 13
  • 33
0

I think what you are looking for is the Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance). This one calculates the difference between words e.g. comparison of "Test" and "Test1" would result to 1 because there is one letter more. You could use the suggested procedures from How to add levenshtein function in mysql? or Levenshtein: MySQL + PHP

This will not only bring up all entries having a leading "-" it even includes the ones with misspelling. You can filter your result data by the calculated distance then.

If you do not want this one because of performance issues you can still use TRIM or REPLACE to filter your symbol and compare it with the other string.

Community
  • 1
  • 1