0

I'm importing data where groups of rows need to be given an id but there is nothing unique and common to them in the incoming data. What there is is a known indicator of the first row of a group and that the data is in order so we can step through row by row setting an id and then increment that id whenever this indicator is found. I've done this however it's incredibly slow, so is there a better way to do this in mysql or am i better off perhaps pre-processing the text data going line by line to add the id.

Example of data coming in, I need to increment an id whenever we see "NEW"

id,linetype,number,text
1,NEW,1234,sometext
2,CONTINUE,2412,anytext
3,CONTINUE,1,hello
4,NEW,2333,bla bla
5,CONTINUE,333,hello
6,NEW,1234,anything

So i'll end up with

id,linetype,number,text,group_id
1,NEW,1234,sometext,1
2,CONTINUE,2412,anytext,1
3,CONTINUE,1,hello,1
4,NEW,2333,bla bla,2
5,CONTINUE,333,hello,2
6,NEW,1234,anything,3

I've tried a stored procedure where i go row by row updating as i go, but it's super slow.

select count(*) from mytable into n;
set i=1;
while i<=n do
  select linetype into l_linetype from mytable where id = i;
  if l_linetype = "NEW" then
    set l_id = l_id + 1;
  end if;
  update mytable set group_id = l_id where id = i;
end while;

No errors, it's just something that i could go line by line reading and writing the text file and do in a second while in mysql it's taking 100 seconds, it'd be nice if there was a way within mysql to do this reasonably fast so separate pre-processing was not needed.

kensands
  • 11
  • 3
  • What is your MySQL server version ? Please run `SELECT Version();` query and let us know the result of the same. – Madhur Bhaiya Oct 05 '19 at 12:10
  • Not writing this as answer, because I would need to write a long disclaimer. But here is something that might work for you: https://www.db-fiddle.com/f/f7aXfRkvhbjcg9Jbp4X1e4/0 – Paul Spiegel Oct 05 '19 at 13:11

2 Answers2

0

In absence of MySQL 8+ (non availability of Windowing functions), you can use a Correlated Subquery instead:

EDIT: As pointed out by @Paul in comments,

SELECT t1.*, 
       (SELECT COUNT(*) 
        FROM your_table t2
        WHERE t2.id <= t1.id 
          AND t2.linetype = 'NEW'
       ) group_id
FROM your_table t1

Above query can be more performant, if we define the following composite index (linetype, id). The order of columns is important, because we have a Range condition on id.

Previously:

SELECT t1.*, 
       (SELECT SUM(t2.linetype = 'NEW') 
        FROM your_table t2
        WHERE t2.id <= t1.id
       ) group_id
FROM your_table t1

Above query requires indexing on id.


Another approach using User-defined Variables (Session variables) would be:

SELECT 
  t1.*, 
  @g := IF(t1.linetype = 'NEW', @g + 1, @g) AS group_id
FROM your_table t1 
CROSS JOIN (SELECT @g := 0) vars 
ORDER BY t1.id

It is like a looping technique, where we use Session Variables whose previous value is accessible during next row's calculation during SELECT. So, we initialize the variable @g to 0, and then compute it row by row. If we can encounter a row with NEW linetype, we increment it, else use the previous row's value. You can also check https://stackoverflow.com/a/53465139/2469308 for more discussion and caveats to take care of while using this approach.

Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
  • Why not `SELECT COUNT(*) FROM your_table t2 WHERE t2.id <= t1.id AND t2.linetype = 'NEW'` in the first query? – Paul Spiegel Oct 05 '19 at 12:53
  • I tried the first and while it works and is neat, it's even slower than the stored procedure and I would assume it gets exponentially slower as more rows are added. I'm testing at the moment with around 10'000 rows from a text file of about 2.5MB. my program that reads line by line, analyses the line and writes out a new text file with the extra field does it in about 0.2 seconds, in mysql it was 37 seconds. – kensands Oct 05 '19 at 13:01
  • @PaulSpiegel your option also works; Infact, I will edit and add your consideration, because it can take advantage of composite index on `linetype, id` – Madhur Bhaiya Oct 05 '19 at 13:01
  • @kensands please check the edited answer; you will also need to define the composite index. But my second query will always be faster than the first query because of no subqueries. – Madhur Bhaiya Oct 05 '19 at 13:05
  • The second one does the job great, any chance you could explain it a little? – kensands Oct 05 '19 at 13:05
  • @kensands Window functions were introduced in MySQL 8.0.2 patch only. Read: https://stackoverflow.com/q/206062/2469308 MySQL has been incrementally adding a lot of mega features in different patches of MySQL8. Keep upgrading it to get the new goodies :-) – Madhur Bhaiya Oct 05 '19 at 13:56
0

For MySql 8.0+ you can use SUM() window function:

select *,
  sum(linetype = 'NEW') over (order by id) group_id
from tablename  

See the demo.
For previous versions you can simulate this functionality with the use of a variable:

set @group_id := 0;
select *,
  @group_id := @group_id + (linetype = 'NEW') group_id
from tablename 
order by id

See the demo.

Results:

| id  | linetype | number | text     | group_id |
| --- | -------- | ------ | -------- | -------- |
| 1   | NEW      | 1234   | sometext | 1        |
| 2   | CONTINUE | 2412   | anytext  | 1        |
| 3   | CONTINUE | 1      | hello    | 1        |
| 4   | NEW      | 2333   | bla bla  | 2        |
| 5   | CONTINUE | 333    | hello    | 2        |
| 6   | NEW      | 1234   | anything | 3        |
forpas
  • 160,666
  • 10
  • 38
  • 76
  • I tried this, my mysql is 8.0.0 and it didn't like the code, I wonder if it was a bit broken in early v8. – kensands Oct 05 '19 at 13:20
  • the second one though as in the answer above works great – kensands Oct 05 '19 at 13:21
  • You can check the fiddle for the 1st query that it works. If your version is 8.0+ you should use window functions. – forpas Oct 05 '19 at 13:23
  • yeah I saw that, I can see the code is good, but it fails in on my 8.0.0 I'm going to try updating to the current release of 8 to see if it is indeed an issue on my version. – kensands Oct 05 '19 at 13:26
  • okay after the painful update to 8.0.17 (had to uninstall everything and start from scratch!) I can confirm as expected this works great too. so in-case anyone comes across this it doesn't work on early 8.0 versions. – kensands Oct 05 '19 at 17:00