1

I am managing a large databse. I am trying to Join a table. but the data in the Colums dont actualy match. One has dashes and the other has space. Such GPD 142 pol (Partnumber)in the Company table and GPD-142-pol (PartNumber)in the Customer table.

My query is written like this:

SELECT *
  FROM CompanyPartsList
  JOIN SalesReport
  On FordPartsList.[Company Part Number] = SalesReport.[Customer Part #]

I tries something like this

SELECT *
  FROM CompanyPartsList
  JOIN SalesReport
  On FordPartsList.[Company Part Number] Like SalesReport.[Customer Part #]

Any help would be appreciated.

ChrisPasa
  • 113
  • 5
  • 15
  • does it have only '-' or any other characters in between?? – Sushil Jun 12 '15 at 17:28
  • doing any operation in a join besides equals, or an operation on a joined column, if there is allot of data, will cause a huge performance problem. its probably best you create a field or trigger to create a column that is 1 to 1 – Michael Rudner Evanchik Jun 12 '15 at 17:33
  • Aside from solving the querying problem, I believe the business must look at improving quality of data – zedfoxus Jun 12 '15 at 17:36

5 Answers5

3

again doing this will be very slow , a solution would be a trigger to create the correct formatted column on either side

SELECT *
FROM CompanyPartsList
JOIN SalesReport
On FordPartsList.[Company Part Number] = Replace(SalesReport.[Customer Part #],'-',' ')
Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331
1

Try replacing characters that can cause the values to be different.

SELECT
    *
FROM
    CompanyPartsList cpl,
    SalesReport sr
WHERE
    REPLACE(REPLACE(cpl.[Company Part Number],'-',''),' ','') = REPLACE(REPLACE(sr.[Customer Part #],'-',''),' ','')
JamieD77
  • 13,796
  • 1
  • 17
  • 27
0

At the end of the day, for the JOIN to work, you've got to have matching values. So your only option is to find a way to make them equal.

Given your examples, I'd experiment to see if there's a way that you could normalize the values to a standard. For example, you could try to remove all the spaces and hyphens on both sides by using REPLACE, and see if that does it for you.

If you're able to get matches that way, you've got two choices. You could always do that normalization on-the-fly when you do the JOIN, but that would probably be performance prohibitive. Or you could add another column to each table, which you update at the same time you update the real part#, setting it to that value with the spaces, hyphens, etc, removed.

Aside from that headache, your potential problem is the chance of clashes. What happens if you've got part#s "123-4B" and "12-34B"? If you use my proposal, those two products would appear the same.

Chris Wuestefeld
  • 3,266
  • 2
  • 23
  • 23
  • It will be slow, but you should be able to JOIN on a calculated field using REPLACE to normalize one table's column to match the other table. – Tingo Jun 12 '15 at 17:38
  • Thats a great answer but on my parts the - got replaced with a space on an import. But ill keep that in mind. – ChrisPasa Jun 12 '15 at 17:44
0

Create a function as mentioned in question 1007697 and modify it gently to strip off anything but alphabets and numbers

Create Function [dbo].[RemoveNonAlphaNumericCharacters](@Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin

    Declare @KeepValues as varchar(50)
    Set @KeepValues = '%[^a-z0-9]%'
    While PatIndex(@KeepValues, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp
End

Then you could compare data, but it may be slow on a large table:

SELECT *
  FROM CompanyPartsList
  JOIN SalesReport
  On RemoveNonAlphaNumericCharacters(FordPartsList.[Company Part Number]) 
   = RemoveNonAlphaNumericCharacters(SalesReport.[Customer Part #])
Community
  • 1
  • 1
zedfoxus
  • 35,121
  • 5
  • 64
  • 63
  • this is also an option, again only one side needs the format, doing on on both would be twice the time. but we dont know his data for sure. this would make sure on both sides but god awful slow. a trigger on insert to create the -'s instead of spaces would be a good idea, on top of the initial update statement for the current data – Michael Rudner Evanchik Jun 12 '15 at 17:42
  • @MichaelRudnerEvanchik - if one side is "ABC--123" and other side is "ABC-123", applying format only on one side will not help. It will be really slow. We are assuming that `-` and spaces are the only differences. Regardless, I hope that they can cleanse the data and keep it clean and have a better future – zedfoxus Jun 12 '15 at 17:48
  • i did say in the middle we dont really know his data. but he said left side has - right side has spaces where the hyphens would be only difference. calling a udf and regex and patindex a little overkill, but i copied your code down if it makes you feel better for other uses (not in a join) – Michael Rudner Evanchik Jun 12 '15 at 17:51
  • Using a table valued function instead of a loop inside a scalar function would be much faster here too. I will post a version of it as another answer here for the sake of completeness. – Sean Lange Jun 12 '15 at 18:35
0

If you end up wanting to use a function I would recommend using an inline table valued function instead of a scalar function that has a while loop inside. The performance of that scalar function is going to degrade quickly as the table gets larger. Here is an example of using an inline table valued function and a tally table so the replacement is set based.

If this was my code I would prefer to use the REPLACE option if that is a possibility.

CREATE FUNCTION [dbo].[StripNonAlphaNumeric_itvf]
(
    @OriginalText VARCHAR(8000)
) RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH
    E1(N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    Tally(N) AS 
    (
        SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    )

select STUFF(
(    
    SELECT SUBSTRING(@OriginalText, t.N, 1)
    FROM tally t
    WHERE 
    (
        ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND  57 --numbers 0 - 9
        OR
        ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 65 AND  90 --UPPERCASE letters
        OR
        ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 97 AND  122 --LOWERCASE letters
    )
    AND n <= len(@OriginalText)
    FOR XML PATH('')
), 1 ,0 , '') AS CleanedText 
Sean Lange
  • 33,028
  • 3
  • 25
  • 40