0

On a digitized version of a 1950s automotive parts book, the model entries were designed to be read by human eyes so to save space they were abbreviated. Of course, this doesn't work with my search engine and I know very little about regular expressions and even less about REGEXP so I hope someone here can assist. I did look through the MySQL manual but I'm afraid there was nothing like this. There are other variations of the data too but I want to start with this bit. I am currently doing the parsing using a PHP script that takes hours to run and, even though it needs to be run only when there is updated data, it takes far too long.

Here is the data in a single row field, separated by semicolons (NOTE: these are on one line in the original but I added returns here to make it more easily readable):

2452-62-65-67-72-92-95-98; 
2552-62-65-72-77-92-95; 
2650-51-52-62-65-72-77-92-95-97; 
5450-51-52-62-65-67-72-77-82-85-92-95-97

... and those that need parsing will always have the dashes but I need it to somehow parse them to look like this. It needs to ignore any text since some have things like ALL 24TH; 25TH; 26TH which post separately:

2452 2462 2465 2467 2472 2492 2495 2498; 
2552 2562 2565 2572 2577 2592 2595; 
2650 2651 2652 2662 2665 2672 2677 2692 2695 2697; 
5450 5451 5452 5462 5465 5467 5472 5477 5482 5485 5492 5495 5497

This simple query will pull up a four digit number but I don't how to proceed.

SELECT
    Models 
FROM
    parts_listing
WHERE
     Models REGEXP '^[0-9]{4}$'
ORDER BY Models;

Here's a shorter example of a basic row, or grouping of models for a given part:

2206-13-26-33; 2302-06-13-32-33

... but the row may also contain other info (that I'll need to eventually parse too) as shown in these examples. For the first and second we need to parse out the ALL and the ordinal and for the third we need to parse out anything between parenthesis. The fourth is an example that begins and ends with ALL values but has some dash-separated values in between.

ALL 24TH; 25TH; 26TH
2262-65-70-71-72-75-76-77-79-80-82-86-92-93-95; ALL 23RD
5401-11-31 (BODIES 5467-77-97)
ALL 22ND; 2301-02-13-32; ALL 24TH; 25TH; 26TH; 54TH

There also some with LHD (or sometimes RHD) and either at the beginning or in some cases in other places in the row and always has a comma after it:

LHD, 2401 (BODIES 2462-65-92-95-98); 2501; 2601-33; 5400-33

It's possible there are other variations too but for now it's only the basic model information that I'm after.

DonP
  • 725
  • 1
  • 8
  • 27

1 Answers1

0

This will work if the row has up to 9 parts.
If there is a case for more parts you can extend the subquery to include more numers than 9:

select  
  group_concat(
    replace(t.part, '-', concat(' ', left(t.part, 2)))
    order by t.partno
    separator ' '                                      
  ) Models
from (
  select t.Models, p.partno,
    replace(replace(
      substring_index(t.Models, ';', p.partno),
      substring_index(t.Models, ';', p.partno - 1),
      ''
    ), ';', '') part 
  from parts_listing t cross join (
    select 1 partno union all select 2 union all select 3 union all
    select 4 union all select 5 union all select 6 union all
    select 7 union all select 8 union all select 9
  ) p 
  where replace(replace(Models, '-', ''), ';', '') regexp'^[0-9]*$'
) t
where t.part <> ''
group by t.Models
order by t.Models 

See the demo.

forpas
  • 160,666
  • 10
  • 38
  • 76
  • That is wonderful and nearly perfect! I didn't want to overly complicate the original question but how do I make it not show ordinal numbers such as ALL 24TH; 25TH; 26TH;, things in parenthesis such as (BODIES 2222-15), or LHD, and RHD, (both have commas after them)? I'll need to parse these too but differently. – DonP Sep 07 '19 at 16:57
  • 1
    The condition: `WHERE Models REGEXP '^[0-9]{4}'` removes any row that does not start with 4 digits. So what is the problem? – forpas Sep 07 '19 at 17:00
  • Some of the rows come out like this: 2631 (BODY 2678); 5431 (BODY 5478) or this: 2401 2402 2406 2413; ALL 25TH; ALL 26TH; ALL 54TH – DonP Sep 07 '19 at 17:11
  • 1
    Try this in the where clause: WHERE replace(replace(Models, '-', ''), ';', '') REGEXP '^[0-9]*$' – forpas Sep 07 '19 at 17:27
  • Yes, that seems to work well! (I had said that there was a syntax error but that was my fault as I copied and pasted the WHERE when it was already there). Also, based on your earlier comment, not all rows begin with 4 digits, in case that matters. There may be some, for example, that begin with LHD, or RHD, or something like ALL 23RD;. This data has so many variations that it's about impossible to find them all for a single question! – DonP Sep 07 '19 at 17:38
  • Yes, my mistake as I had pasted in the WHERE too when it was already there. However, it seems to give only the first group now so anything after ; is gone. – DonP Sep 07 '19 at 17:45
  • 1
    What do you mean first group. You did not mention anything about groups. – forpas Sep 07 '19 at 17:46
  • I meant the sections between parenthesis. It's getting only 2452 2462 2465 2467 2472 2492 2495 2498 in my original example and missing all the rest after the first semicolon. – DonP Sep 07 '19 at 17:50
  • 1
    It seems I missed this part: *these are on one line in the original but I added returns here to make it more easily readable* of your question. I thought that these were different rows in the table. Now for your problem: you need a split string functionality which MySql does not provide. If you can change the design. Delimited strings can only cause problems. Read this: https://stackoverflow.com/questions/1690017/database-design-is-it-bad-to-keep-delimited-strings-in-a-database – forpas Sep 07 '19 at 18:01
  • I cannot change the design as it was OCRed from an original 1954 document and needs to be like it is to maintain the appearance of the original when presented on-screen. However, your original query was pulling up results that were almost exactly what I needed but REGEXP should be able filter out anything between parenthesis or any ordinals. I suppose it won't hurt leaving those things there but it makes the data cumbersome. – DonP Sep 07 '19 at 18:08
  • 1
    My query works for only the 1st part before the 1st semicolon. After that it adds the first 2 digits from the 1st part to all other parts which is not what you need, right? See the demo: https://www.db-fiddle.com/f/h7ZJ6nqRGRQbRMTusHYVcr/2 As I said MySql does not provide a split functionality so that you can split the parts and take the first 2 digits from each part – forpas Sep 07 '19 at 18:13
  • You're right, I hadn't noticed that it was doing that: 2206 2213 2226 2233; 2302 2206 2213 2232 2233 – DonP Sep 07 '19 at 18:31
  • MySQL doesn't have SPLIT but couldn't this be done somehow using SUBSTRING or SUBSTRING_INDEX? I've used these before for splitting data but not sure how to apply it here. Otherwise it would work if each number could be parsed out individually, each in it's own row into another table such as parts_parsed (along with a couple other fields such as ID to relate each to a specific entry) – DonP Sep 08 '19 at 05:53
  • Thank you. The fiddle seems to work perfectly but here it just repeats: 2359 2375 2379 2385 2359 2375 2379 2385. PartNo is actually an existing column in the table (and a value, along with ID, that I would like to pull up too) so could that be the problem? – DonP Sep 08 '19 at 14:29
  • I don't know what would be the problem. You can change the name of `partno` in my query to something else so it does not conflict with the other column's name. You have to adjust this code that works with this sample data to your actual data. – forpas Sep 08 '19 at 14:34
  • I had tried renaming partno to ModelList but no difference. This seems so close so no idea what the difference might be or where to add additional columns to the query. Where would I add additional columns? – DonP Sep 08 '19 at 14:40
  • This code is about the question and the sample data as you posted them. If you have a new requirement for a new query you can post a new question. – forpas Sep 08 '19 at 14:43
  • Will do, thank you but one more comment. One thing I just noticed is that it should be pulling up over 15,000 rows but it is getting only 555. If I add in the other columns that I want, I get 5,521 rows which is still far short. Something is limiting it and I'm not sure what. – DonP Sep 08 '19 at 16:29
  • This code works for the question and sample data you posted. If you have anew requirement post a new question. – forpas Sep 08 '19 at 16:39