0

I have a column like this :

Codes
--------------------------------------------------
3/1151---------366-500-2570533-1
9/6809---------------------368-510-1872009-1  
1-260752-305-154----------------154-200-260752-1--------154-800-13557-1
2397/35425---------------------------------377-500-3224575-1
17059----------------377-500-3263429-1
126/42906---------------------377-500-3264375-1
2269/2340-------------------------377-500-3065828-1
2267/767---------377-500-1452908-4
2395/118593---------377-500-3284699-1
2395/136547---------377-500-3303413-1
92/10260---------------------------377-500-1636038-1
2345-2064---------377-500-3318493-1
365-2290--------377-500-3278261-12
365-7212--------377-500-2587120-1

How can I extract codes with this format:

3digit-3digit-5to7digit-1to2digit
xxx-xxx-xxxxxx-xx

The result must be :

Codes
--------------------------------------------------
366-500-2570533-1
368-510-1872009-1  
154-200-260752-1 , 154-800-13557-1     -- have 2 code template
377-500-3224575-1
377-500-3263429-1
377-500-3264375-1
377-500-3065828-1
377-500-1452908-4
377-500-3284699-1
377-500-3303413-1
377-500-1636038-1
377-500-3318493-1
377-500-3278261-12
377-500-2587120-1
------------------------------------

This problem is completely tired of me.

Thanks for reading about my problem

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Regex is possibly one tool which could be used here. But as SQL Server has weak regex support, you might have a hard time using regex. Is there any chance you could use some other tool? – Tim Biegeleisen Apr 28 '19 at 09:02
  • yes but i do not know how to create this template using regex – dasdsadasdsad asdasdas Apr 28 '19 at 09:04
  • What have you tried so far? Really you should be storing this data in separate columns & rows and then you wouldn't have this problem. – Thom A Apr 28 '19 at 09:40
  • It doesn't help, as well, that your delimiter (which I assumed was `'---------'`) isn't of a consistent length and also appears in your data. – Thom A Apr 28 '19 at 10:32

2 Answers2

1

This is really ugly, really really ugly. I don't for one second suggest doing this in your RDBMS, and really I suggest you fix your data. You should not be storing "delimited" (I use that word loosely to describe your data) data in your tables you should be storing in in separate columns and rows. In this case, the first "code" should be in one column, with a one to many relationship with another table with the codes you're trying to extract.

As you haven't tagged or mentioned your Version of SQL Server I've used the latest SQL Server syntax. STRING_SPLIT is available in SQL Server 2016+ and STRING_AGG in 2017+. If you aren't using those versions you will need to replace those functions with a suitable alternative (I suggest delimitedsplit8k(_lead) and FOR XML PATH respectively).


Anyway, what this does. Firstly we need to fix that data to something more useable, so I change the double hyphens (--) to a Pipe (|), as that doesn't seem to appear in your data. Then then use that pipe to split your data into parts (individual codes).

Because your delimiter is inconsistent (it isn't a consistent width) this leaves some codes with a leading hyphen, so I have to then get rid of that. Then I use my answer from your other question to split the code further into it's components, and reverse the WHERE; previously the answer was looking for "bad" rows, where as now we want "good" rows.

Then after all of that, it's as "simple" as using STRING_AGG to delimit the "good" rows:

SELECT STRING_AGG(ca.Code,',') AS Codes
FROM (VALUES('3/1151---------366-500-2570533-1'),
            ('9/6809---------------------368-510-1872009-1'),
            ('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
            ('2397/35425---------------------------------377-500-3224575-1'),
            ('17059----------------377-500-3263429-1'),
            ('126/42906---------------------377-500-3264375-1'),
            ('2269/2340-------------------------377-500-3065828-1'),
            ('2267/767---------377-500-1452908-4'),
            ('2395/118593---------377-500-3284699-1'),
            ('2395/136547---------377-500-3303413-1'),
            ('92/10260---------------------------377-500-1636038-1'),
            ('2345-2064---------377-500-3318493-1'),
            ('365-2290--------377-500-3278261-12'),
            ('365-7212--------377-500-2587120-1')) V(Codes)
    CROSS APPLY (VALUES(REPLACE(V.Codes,'--','|'))) D(DelimitedCodes)
    CROSS APPLY STRING_SPLIT(D.DelimitedCodes,'|') SS
    CROSS APPLY (VALUES(CASE LEFT(SS.[value],1) WHEN '-' THEN STUFF(SS.[value],1,1,'') ELSE SS.[value] END)) ca(Code)
    CROSS APPLY (VALUES(PARSENAME(REPLACE(ca.Code,'-','.'),4),
                        PARSENAME(REPLACE(ca.Code,'-','.'),3),
                        PARSENAME(REPLACE(ca.Code,'-','.'),2),
                        PARSENAME(REPLACE(ca.Code,'-','.'),1))) PN(P1, P2, P3, P4)
WHERE LEN(PN.P1) = 3
   AND LEN(PN.P2) = 3
   AND LEN(PN.P3) BETWEEN 5 AND 7
   AND LEN(PN.P4) BETWEEN 1 AND 2
   AND ca.Code NOT LIKE '%[^0-9\-]%' ESCAPE '\'
GROUP BY V.Codes;

db<>fiddle

Thom A
  • 88,727
  • 11
  • 45
  • 75
0

You have several problems here:

  • Splitting your longer strings into the codes you want.
  • Dealing with the fact that your separator for the longer strings is the same as your separator for the shorter ones.
  • Finding the patterns that you want.

The last is perhaps the simplest, because you can use brute force to solve that.

Here is a solution that extracts the values you want:

with t as (
      select v.*
      from (values ('3/1151---------366-500-2570533-1'), 
                   ('9/6809---------------------368-510-1872009-1'), 
                   ('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
                   ('2397/35425---------------------------------377-500-3224575-1')
           ) v(str)
     )
select t.*, ss.value
from t cross apply
     (values (replace(replace(replace(replace(replace(t.str, '--', '><'), '<>', ''), '><', '|'), '|-', '|'), '-|', '|'))
     ) v(str_sep) cross apply
     string_split(v.str_sep, '|') ss
where ss.value like '%-%-%-%' and
      ss.value not like '%-%-%-%-%' and
      (ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9]' or
       ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
       ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
       ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
       ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
       ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]'
      );

Here is a db<>fiddle.

I would strongly encourage you to find some way of doing this string parsing anywhere other than SQL.

The key to this working is getting the long string of hyphens down to a single delimiter. SQL Server does not offer regular expressions for the hyphens (as some other databases do and as is available in other programming languages). In Python, for instance, this would be much simpler.

The strange values statement with a zillion replaces is handling the repeated delimiters, replacing them with a single pipe delimiter.

Note: This uses string_split() as a convenience. It was introduced in SQL Server 2017. For earlier versions, there are plenty of examples of string splitting functions on the web.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Judging by the Fiddle, Gordon, this doesn't appear to give the results the OP is after. The OP wants the value `'154-200-260752-1 , 154-800-13557-1'`but they are in 2 different rows in your solution. Also, the other 3 rows aren't returned in your data set. If we take this [fiddle](https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=be4b374fbbb37160e4e26ef3f79f30b6) we can see the missing rows don't appear to be being split up correctly. – Thom A Apr 28 '19 at 11:16
  • @Larnu . . . That is a really good catch. It was a very small mistake (I removed the delimiter in the `|-` and `-|` in addition to removing the hyphen). It is now fixed. I have no interest in putting the values back into a single row. That seems like a really bad idea. – Gordon Linoff Apr 28 '19 at 12:37