1

I am trying to find records in my database (SQL Server) that have values in common. I am grouping them and counting them as below:

select count(ItemNum) as RecordCount, Vendor, PartNumber 
from Products
where (RecordCount > 2)
group by Vendor, PartNumber
order by RecordCount desc, Vendor, PartNumber

I want the result table to have all existing combinations of Vendor and PartNumber and look like RecordCount, Vendor, PartNumber

This works great, but sometimes there are minor differences in the values and I want to group them together despite those differences. Specifically I want to ignore whitespace and non-alphanumeric characters.

For example, PartNumber = "p120" should match with "p1_20" and "p1 20".

How can I do this?

GreySage
  • 1,153
  • 19
  • 39
  • which database are you using? – nbk Mar 29 '22 at 18:11
  • @nbk Microsoft SQL server. – GreySage Mar 29 '22 at 18:18
  • sql server is missng such functionality tried with a function like https://stackoverflow.com/questions/21378193/regex-pattern-inside-sql-replace-function – nbk Mar 29 '22 at 18:46
  • 1
    If your part number is a string, how can you be sure `PL0-21` is the same as `PL02-1`? (I don't think you have a query problem to solve, I think you have a data problem to solve.) – Aaron Bertrand Mar 29 '22 at 20:16
  • @AaronBertrand The point of the query is to solve the data problem, by showing a user the potential duplicates and allowing them to resolve them. So, I have both problems :) – GreySage Mar 31 '22 at 15:44

2 Answers2

2

Try to use string functions to remove whitespaces. Here is list below:

LTRIM RTRIM REPLACE

These functions will remove / replace extra chars. For example:

-- PartNumber = " 120_1 "
SELECT REPLACE(LTRIM(RTRIM(PartNumber)), '_', '') FROM Products
-- PartNumber = "1201"

I created a table with test data:

INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '1025');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (14, 'VendorName', '1_ 025');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102 5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102_5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', ' 1025 ');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102 5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (223, 'OtherVendorName', '    9_02');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (102, 'OtherVendorName', '902');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (255, 'OtherVendorName', '  902 ');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (31, 'OtherVendorName', '902_');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (33, 'OtherVendorName', '9 02');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (55, 'OtherVendorName', '  902  ');
    

Complete query with data from the previous code block.

SELECT COUNT(ItemNum) AS RecordCount, Vendor, REPLACE(REPLACE(LTRIM(RTRIM(PartNumber)), '_', ''), ' ', '') AS PartNumber
FROM Products
GROUP BY Vendor, REPLACE(REPLACE(LTRIM(RTRIM(PartNumber)), '_', ''), ' ', '')
HAVING COUNT(ItemNum) > 2
ORDER BY COUNT(ItemNum) DESC, 2, 3

Returns

RecordCount Vendor PartNumber
6 OtherVendorName 902
6 VendorName 1025
L0ndl3m
  • 141
  • 5
  • Please don't use images for data, especially images which can't be seen without clicking a link. The editor allows tabular data as I have done for you. – Dale K Mar 29 '22 at 19:37
  • 1
    Thank you for the note. I will do as you advised. – L0ndl3m Mar 29 '22 at 19:53
  • I suppose I can use replace for each unwanted character, but I imagine if will be slow. – GreySage Mar 31 '22 at 15:51
-1

Assuming you are evaluating only numbers in those part numbers and letters are ignored...

SELECT COUNT(p2.PartNumber) RecordCount, p1.Vendor, p1.PartNumber 
FROM Products p1
LEFT JOIN Products p2 ON LEFT(SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p1.PartNumber), 4000),PATINDEX('%[^0-9]%', SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p1.PartNumber), 4000)))
= LEFT(SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p2.PartNumber), 4000),PATINDEX('%[^0-9]%', SUBSTRING(p2.PartNumber, PATINDEX('%[0-9]%', p2.PartNumber), 4000)))
GROUP BY p1.Vendor, p1.PartNumber,
HAVING COUNT(p2.PartNumber) > 2
ORDER BY COUNT(p2.PartNumber) DESC, Vendor, PartNumber

I just set the substring with a max for nvarchar since I don't know your data types.

GlockByte
  • 92
  • 5