MySQL Field Data Parsing

Question

Started here, I was asked to create a new post in order to flesh it out which I'll try to do here but I'm afraid it's rather wordy. I created a test table with what seems to be a good cross-section of the data that needs to be parsed. Using this query from here to which I added some additional fields that are needed, this is the query now.

SELECT ID, PartNo,
  GROUP_CONCAT(
    REPLACE(t.part, '-', CONCAT(' ', LEFT(t.part, 2)))
    ORDER BY t.ModelList
    SEPARATOR ' '                                      
  ) Models
FROM (
  SELECT t.ID, t.PartNo, t.Models, p.ModelList,
    REPLACE(REPLACE(
      SUBSTRING_INDEX(t.Models, ';', p.ModelList),
      SUBSTRING_INDEX(t.Models, ';', p.ModelList - 1),
      ''
    ), ';', '') part 
  FROM parts_listing t CROSS JOIN (
    SELECT 1 ModelList UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
    SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL
    SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
  ) p 
  WHERE REPLACE(REPLACE(Models, '-', ''), ';', '') REGEXP '^[0-9]*$'
) t
WHERE t.part <> ''
GROUP BY t.ID, t.PartNo, t.Models
ORDER BY t.PartNo, t.Models

For the most basic parts, the Models data looks like this:

2206-13-26-33; 2302-06-13-32-33

. . . and this is the part that I am working on now. The Models column also contains ALL 23RD etc that will also need to be parsed somehow but for now I am concerned only with the data as shown above and I need to parse it to something like:

2206 2213 2226 2233 2302 2306 2313 2332 2333

The query above, however, seems to be repeating the first bit so it comes out as:

2206 2213 2226 2233 2202 2206 2213 2232 2233

It gives the result in one chunk but it would also be okay if each result was on it's own line as ultimately it will be dumped into another table and that is the way it will be by then.

It also gives too few results. The actual table contains over 15,000 rows but this is fetching only a small fraction and in the data sample below, it should fetch about 20 rows but actually fetches only 6 although the fiddle, which has a very limited data set, does indeed give the proper results. Running the the fiddle query against my own data fetches only 555 rows.

A cross-section sample of data can be created with this:

DROP TABLE IF EXISTS `parts_listing`;
CREATE TABLE IF NOT EXISTS `parts_listing` (
  `ID` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `GroupNumber` varchar(20) DEFAULT NULL,
  `PartNo` varchar(30) DEFAULT NULL,
  `Models` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=34 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

INSERT INTO `parts_listing` (`ID`, `GroupNumber`, `PartNo`, `Models`) VALUES
    (1, '0.000', '391906', '2201-02-11-20-22-32-40; 2301'),
    (2, '0.000', '391907', '2206-26-33'),
    (3, '0.000', '399391', '2213'),
    (4, '0.000', '415209', '2302-06-32-33'),
    (5, '0.000', '415245', '2313'),
    (6, '0.000', '437273', 'ALL 24TH; 25TH; 26TH'),
    (7, '0.000', '436803', '2631 (BODY 2678); 5431 (BODY 5478)'),
    (8, '0.000', '448611', 'ALL 54TH'),
    (9, '0.0015', '372174', 'ALL 22ND; 23RD'),
    (10, '0.0015', '399324', '2201-02-11-20-22-32-40; 2301'),
    (11, '0.0015', '399324', '2206-13-26-33; 2302-06-13-32-33'),
    (12, '0.0015', '372146', 'ALL 22ND; 23RD'),
    (13, '0.0015', 'G179359', '2201-02-11-20-22-32-40; 2301'),
    (14, '0.0015', 'G179559', '2206-13-26-33; 2302-06-13-32-33'),
    (15, '0.0015', '422024', '2201-02-11-20-22-32-40; 2301'),
    (16, '0.0015', '422024', '2206-13-26-33; 2302-06-13-32-33'),
    (17, '0.0016', '378664', 'ALL 22ND; 23RD'),
    (18, '0.0016', 'G178532', 'ALL 22ND; 23RD'),
    (19, '0.0016', 'G123197', 'ALL 22ND; 23RD'),
    (20, '30.793', '448421', 'ALL 54TH'),
    (21, '30.7932', '422762', '2462-65-92-95-98; 2562-65-92-95; 2662-65-92-95'),
    (22, '30.7932', '448902', '5462-65-67-82-85-92-95-97'),
    (23, '30.7938', '399941', 'ALL 22ND; 23RD'),
    (24, '30.7938', '425809', 'ALL 24TH; 25TH; 26TH'),
    (25, '30.7938', '447799', 'ALL 54TH'),
    (26, '30.79383', '399927', 'ALL 22ND; 23RD; 24TH; 25TH; 26TH'),
    (27, '30.79396', 'PA437550', '2562-65-77-79; 2677-79'),
    (28, '30.79398', '422470', '2452-72; 2552-72'),
    (29, '30.79398', '422471', '2452-72; 2552-72'),
    (30, '30.79398', '422553', 'ALL 24TH; 25TH'),
    (31, '30.8007', '425517', '2413-52-62-65-67-72-92-95-98; 2552-62-65-72-77-92-95; 2650-51-52-62-65-72-77-92-95-97; 5450-51-52-62-65-67-72-77-82-85-92-95-97'),
    (32, '30.8007', '425920', 'RHD, 2462-65-92-95; 2562-65-92-95; 2662-65-92-95-97; 5462-65-82-85-92-95-97'),
    (33, '30.8044', 'G129662', '2293');

nbk · Answer 1 · 2019-09-10T08:58:12.467

You can try this. I changed the parts where You split the Models column. I split them into different rows, and let the code from forpas go over it and join them again together. rows with text in it aren't touched, because you didn't specify nothing.

select  Id,GroupNumber,PartNo,
  group_concat(
    replace(t.part, '-', concat(' ', left(t.part, 2)))
    order by t.partno, t.part ASC
    separator ' '                                      
  ) Models
from
(SELECT
  Id,GroupNumber,PartNo,
  TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(models, ';', n.digit+1), ';', -1)) part
FROM
  ( SElECT Id,GroupNumber,PartNo,
   REPLACE(REPLACE(REPLACE(Models, ')', ''), 'RHD,', ''), '(BODY', ';') Models
   FROM
   parts_listing ) p_l
  INNER JOIN
  (SELECT 0 digit UNION ALL select 1 union all select 2 union all select 3 union all
    select 4 union all select 5 union all select 6 union all
    select 7 union all select 8 union all select 9) n
  ON LENGTH(REPLACE(Models, ';' , '')) <= LENGTH(Models)-n.digit
  Where REPLACE(REPLACE(REPLACE(Models, '-', ''), ' ', ''), ';', '') REGEXP '^[0-9]+$'
ORDER BY
  Id,
  n.digit,part) t
   Group by  Id,GroupNumber,PartNo
   ;

Gets you

Id  GroupNumber     PartNo  Models
1   0.000   391906  2201 2202 2211 2220 2222 2232 2240 2301
2   0.000   391907  2206 2226 2233
3   0.000   399391  2213
4   0.000   415209  2302 2306 2332 2333
5   0.000   415245  2313
7   0.000   436803  2631 2678 5431 5478
10  0.0015  399324  2201 2202 2211 2220 2222 2232 2240 2301
11  0.0015  399324  2206 2213 2226 2233 2302 2306 2313 2332 2333
13  0.0015  G179359     2201 2202 2211 2220 2222 2232 2240 2301
14  0.0015  G179559     2206 2213 2226 2233 2302 2306 2313 2332 2333
15  0.0015  422024  2201 2202 2211 2220 2222 2232 2240 2301
16  0.0015  422024  2206 2213 2226 2233 2302 2306 2313 2332 2333
21  30.7932     422762  2462 2465 2492 2495 2498 2562 2565 2592 2595 2662 2665 2692 2695
22  30.7932     448902  5462 5465 5467 5482 5485 5492 5495 5497
27  30.79396    PA437550    2562 2565 2577 2579 2677 2679
28  30.79398    422470  2452 2472 2552 2572
29  30.79398    422471  2452 2472 2552 2572
31  30.8007     425517  2413 2452 2462 2465 2467 2472 2492 2495 2498 2552 2562 2565 2572 2577 2592 2595 2650 2651 2652 2662 2665 2672 2677 2692 2695 2697 5450 5451 5452 5462 5465 5467 5472 5477 5482 5485 5492 5495 5497
32  30.8007     425920  2462 2465 2492 2495 2562 2565 2592 2595 2662 2665 2692 2695 2697 5462 5465 5482 5485 5492 5495 5497
33  30.8044     G129662     2293

Dbfiddle example

Thank you. It seems to be working after a fashion but you'll note that some of the results have only three digit numbers when all must be four digit. Also, note that there are some rows that contain both ordinal numbers and dash-separated values and those dash-separated values need to be parsed, ignoring the other pieces. I have to somehow parse the ordinal numbers too but that will probably have to be for a different post as it requires a database lookup. For the current code, only the four digit numbers should be in the results and rows that have only ordinals shouldn't show up at all. — DonP, Sep 09 '19 at 01:34
so i changed it. You other rows, as i said in the inner select you have all data in a singkle row i.e ALL 54TH thtat you can detect and replace it. I don'T know what — nbk, Sep 09 '19 at 10:48
Thank you, that appears to be pulling out the proper information but it's coming out in reverse order with the higher numbers first. Any way to keep them in the order that they are as shown in my question above? Also, it should be returning only 20 rows (if I counted correctly) but it is pulling up all. It does not need to show any rows that do not contain the four-digit numbers and I'm not sure how to do that. — DonP, Sep 09 '19 at 18:01
So ichanged ia one more time., now it is sorted and everything without numbers out — nbk, Sep 09 '19 at 19:18
Much better! However, it is pulling up only 18 rows when it should be pulling up 20 so it's still missing ID 7 and 32. On 7, it needs to parse the numbers outside the parenthesis and on 32 it needs to parse the entries, ignoring the RHD, at the beginning. — DonP, Sep 10 '19 at 00:43
As you can see, i removed or changed(See replacement of (Body to ;) the unwanted parts, so far that works because there are very simple pattern to eliminate.or repplace — nbk, Sep 10 '19 at 09:01
Thank you and this seems to work too as does the other answer by someone else. Yours gives 20 rows which is correct while the other splits the groups to 39 rows which is also correct. — DonP, Sep 10 '19 at 16:05

Andrei Odegov · Accepted Answer · 2019-09-12T10:03:35.227

1

I think you need a more sophisticated way to verify that the Models column values are correct due to spaces after semicolons. In addition, if you need to process the Models column values that contain more than ten parts, you will need a longer sequence of numbers. I use the INFORMATION_SCHEMA COLUMNS table for this. You can also create an auxiliary table for this.

SELECT
  CONCAT(
    IF(N.n = 0, '', LEFT(t.part, 2)),
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(t.part, '-', N.n + 1), '-', -1
    )
  ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
FROM Numbers AS N CROSS JOIN (
  SELECT
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(t.part, ' ', N.n + 1), ' ', -1
    ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
  FROM Numbers AS N CROSS JOIN (
    SELECT
      SUBSTRING_INDEX(
        SUBSTRING_INDEX(t.part, ';', N.n + 1), ';', -1
      ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
    FROM Numbers AS N CROSS JOIN (
      SELECT
        REPLACE(
          SUBSTRING_INDEX(
            SUBSTRING_INDEX(t.part, 'LHD', N.n + 1), 'LHD', -1
          ), ',', ''
        ) AS part, IF(N.n > 0, 1, t.isHD) AS isHD,
        t.ID, t.GroupNumber, t.PartNo, t.Models
      FROM Numbers AS N CROSS JOIN (
        SELECT
          SUBSTRING_INDEX(
            SUBSTRING_INDEX(p.Models, 'RHD', N.n + 1), 'RHD', -1
          ) AS part, IF(N.n > 0, 2, 0) AS isHD,
          p.ID, p.GroupNumber, p.PartNo, p.Models
        FROM Numbers AS N CROSS JOIN parts_listing AS p
        WHERE p.Models REGEXP '[[:<:]]([LR]HD,? +)?[0-9]+(-[0-9]+)*[[:>:]]' AND
              N.n <= (LENGTH(p.Models) - LENGTH(REPLACE(p.Models, 'RHD', ''))) / 3
      ) AS t WHERE N.n <= (LENGTH(t.part) - LENGTH(REPLACE(t.part, 'LHD', ''))) / 3
    ) AS t WHERE N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, ';', ''))
  ) AS t WHERE N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, ' ', ''))
) AS t WHERE t.part REGEXP '^[0-9]+(-[0-9]+)*$' AND
             N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, '-', ''));

Partial output (29 rows):

+------+------+----+-------------+---------+--------------------------------------+
| part | isHD | ID | GroupNumber | PartNo  |                Models                |
+------+------+----+-------------+---------+--------------------------------------+
| 2201 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2202 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2211 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2220 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2222 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2232 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2240 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2301 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2213 |    1 |  3 | 0.000       | 399391  | LHD 2213                             |
| 2313 |    2 |  5 | 0.000       | 415245  | RHD, 2313                            |
| 2631 |    0 |  7 | 0.000       | 436803  | 2631 (BODY 2678); 5431 (BODY 5478)   |
| 5431 |    0 |  7 | 0.000       | 436803  | 2631 (BODY 2678); 5431 (BODY 5478)   |
| 2206 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2213 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2226 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2233 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2302 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2306 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2313 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2332 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2333 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2201 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2202 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2211 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2220 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2222 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2232 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2240 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2301 |    2 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
+------+------+----+-------------+---------+--------------------------------------+

Demo on DB Fiddle.

edited Sep 12 '19 at 10:03

answered Sep 09 '19 at 07:10

Andrei Odegov

2,925
2
15
21

This is good but I'm not sure if it's pulling up everything. It appears that each section is on its own line in the results which is fine but, if I counted properly, there should be about 39 rows in the output, not 33. Maybe it is missing the entries in 2631 (BODY 2678); 5431 (BODY 5478) as it should, for now, ignore anything within parenthesis which I forgot to say in the original question. Also, doing error checking is both good and bad. If it filters any errors, that's good but it also keeps me from seeing and repairing the errors which is not good! – DonP Sep 09 '19 at 18:57
Incidentally, there are no Models entries with more than nine parts and, as far as I can tell, none will be longer than seven. – DonP Sep 09 '19 at 19:00
Note also that these do not necessarily start with a four-digit number. There may be some that have ALL 23RD; 2401-02; ALL 25TH for example and there are some with RHD, or LHD, (with commas). – DonP Sep 09 '19 at 19:06
1

I updated my answer by adding another layer to split the `Models` column values by spaces. – Andrei Odegov Sep 10 '19 at 11:02
Great and when split this way, 39 is the exact number of rows that there should be so it appears that nothing has been missed. Can we put each digit on a single row? – DonP Sep 10 '19 at 16:01
On the RHD, (and some my have LHD, and it isn't necessarily at the beginning of the row so it would be only on all the groupings following), is it possible, rather than remove it as I asked, to instead put it into a new column? I was going to post this in another question but maybe doing it here is better. – DonP Sep 10 '19 at 16:39
1

My answer has been updated. In it I use an auxiliary table of numbers. But I do not understand what to do with the RHD (and the LHD). Can you give an example? – Andrei Odegov Sep 10 '19 at 19:04
That looks like it will work perfectly! I'll have to change it a bit to match my column names but your test output is perfect as far as I can tell. On the other issue, I want to add a varchar column to the table (isRHD) and simply put "RHD" into it on any entries where it applies. For example, entries like this RHD, 2462-65-92-95; 2562-65-92-95 with have RHD (or LHD) for all of them while entries like this 2462-65-92-95; RHD, 2562-65-92-95 will have it only on those it precedes. The data itself is unclear so I am supposing this is what was meant when it was created originally in the 1950s. – DonP Sep 11 '19 at 02:14
Is it necessary to have a temporary table Numbers or can it be done without it? – DonP Sep 11 '19 at 02:15
1

This table is created in lines 3 through 10 of the SQL Schema panel. Using this table makes the query faster, concise, and readable. In the latest version of my answer, I also tried to process the LHD and the RHD. – Andrei Odegov Sep 11 '19 at 13:36
That's great, thank you! Doing the parsing in PHP was taking many hours to run but this does it in less than a minute. However, the RHD is getting only the first group so for entries like RHD, 2462-65-92-95; 2562-65-92-95; 2662-65-92-95-97; 5462-65-82-85-92-95-97, only those beginning with 24 have it marked. Also, RHD, isn't always at the beginning - it can be before any grouping and will apply to it any any after. Since RHD, (or LHD,) always in in a comma, maybe they can be found as a four character string ending in a comma. – DonP Sep 11 '19 at 15:29
1

Is the RHD/LHD always at the beginning of a line? – Andrei Odegov Sep 11 '19 at 16:06
No, I keep saying that it is not always at the beginning but it does always have a comma after it. – DonP Sep 11 '19 at 16:14
On the RHD/LHD issue, did you tweak it? It seems to be finding those now that are not at the beginning but it is not getting all of them. For example, RHD, 2462-65-92-95; 2562-65-92-95; 2662-65-92-95-97; 5462-65-82-85-92-95-97 should have RHD for all the segments clear to the end of the line but instead it stops at the end of the 24 grouping. Other than that, it seems to work remarkably well and is quite fast. Incidentally, I renamed Numbers to parts_temp as my server can't have upper case table names and I rearranged the output to suit the table it gets dumped into. – DonP Sep 12 '19 at 05:45
The next attempt of my answer is very similar to a nested doll :) I also rewrote the regular expression on line 33. You can get its explanation on [RegEx101](https://regex101.com/r/CGLGtI/1). – Andrei Odegov Sep 12 '19 at 09:10
Yes, much like a nested doll but this one works. Thank you! I'll post a new question for the ALL and BODY/BODIES as those will have to get parsed too somehow. On ALL, it's not so much a parsing as it will be a lookup to another table and I doubt that can be done in this same query. – DonP Sep 12 '19 at 16:39
Can you migrate to MySQL 8.0? It has recursive CTEs and much more powerful support for regular expressions, which makes it easier to solve your problem, in my opinion. – Andrei Odegov Sep 12 '19 at 17:30
Unfortunately I cannot migrate as my server is with a hosting company and my account is what they called "shared" so such wide-spread changes are made only rarely. They only recently upgraded PHP and even then it's still one of the 5.X versions although I believe I now have the option to switch to 7.X if I want. – DonP Sep 12 '19 at 19:11

MySQL Field Data Parsing

2 Answers2

Linked