1

I have two tables:

  1. "Sessions" - it have int key identity, "session_id" - varchar, "device_category" - varchar and some other colums. There are 149239 rows.

  2. Session_events" - it have int key identity, "session_id" - uniqueidentifier and some other fields. There are 3140768 rows there.

This tables has been imported from not relational database - Cassandra, so I not created any connections in MS SQL Server designer. But real connection between Sessions and Session_events on column session_id is Many-To-Many

Now I want to delete all web-sessions that was not take place on Personal Computer "device_category". So I run request Delete * FROM sessions where device_category != "PC" that was fast. Now I want to to delete all not PC sessions from Session_events table. So I run request

Delete FROM session_events where session_id Not In (SELECT distinct session_id FROM sessions)

That request is currently running for more then 24 hour and I don't know how long it can take...

(I have 16 GB ram and Intel Xenon).

I know that Left Join can be faster but 20% is not interesting. Do you see the way to finish my task much faster?

----
CREATE TABLE [dbo].[session_events](
    [key] [bigint] IDENTITY(1,1) NOT NULL,
    [session_id] [uniqueidentifier](max) NULL,
    [visitor_id] [uniqueidentifier] NULL,
    [shipping_method] [varchar](max) NULL,
    [shipping_price] [varchar](max) NULL,
    [site_id] [int] NULL,
    [stream_data_chunk] [varbinary](max) NULL,
    [total] [varchar](max) NULL,
    [total_inc_tax] [varchar](max) NULL,
    [tracker_ver] [varchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

CREATE TABLE [dbo].[visitor_sessions](
    [key] [int] IDENTITY(1,1) NOT NULL,
    [visitor_id] [varchar](max) NULL,
    [created] [varchar](max) NULL,
    [session_id] [varchar](max) NULL
)

 CONSTRAINT [PK_visitor_sessions4] PRIMARY KEY CLUSTERED 
(
    [key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO
Brans Ds
  • 4,039
  • 35
  • 64
  • Do you have indexes in the tables? SHow us the `CREATE TABLE` scripts and the indexes on both tables. – ypercubeᵀᴹ Apr 14 '15 at 07:55
  • I don't have create table script.. I created in each table two bigint autoincrimented Identity column wich are Indexable – Brans Ds Apr 14 '15 at 07:59
  • You can find the CREATE TABLE with SSMS (I suppose you can connect to the database with SSMS?) – ypercubeᵀᴹ Apr 14 '15 at 08:00
  • 1
    I see no `session_id` in table `session_events`. Are you sure this is the actual query you run? – ypercubeᵀᴹ Apr 14 '15 at 08:08
  • I removed some colums there there are a lot of colums.. I have updated and add session_id – Brans Ds Apr 14 '15 at 08:12
  • One remark: Your session_id column in session is `NULL`able. If you indeed have NULLs in your session table, then using `NOT IN` in your `DELETE` is *not* going to work as expected. See e.g. http://stackoverflow.com/questions/129077/not-in-clause-and-null-values – Frank Schmitt Apr 14 '15 at 08:17
  • Well, the reason for the slowness is first, (and most probably) the lack of indexes on the 2 `session_id` columns and/or second, another query is blocking your query. Investigate first if it's a blocking issue (unlikely I think considering it's running for 24 hours) – ypercubeᵀᴹ Apr 14 '15 at 08:17
  • Then, you'll have 2 choices. Either kill the query, wait for the rollback to finish (which may take another 24 hours!) and then add the 2 indexes and retry. Or wait for the query to finish. – ypercubeᵀᴹ Apr 14 '15 at 08:18
  • Oh, I also see that the 2 columns are of different type! One is `varchar` and the other is `uniqueidentifier` ... You'll have to fix that, too. – ypercubeᵀᴹ Apr 14 '15 at 08:23

5 Answers5

4

Deleting a huge bunch of data at once means that the DB engine has to do all of that in a single transaction. This means a huge overhead when you don't actually need that (e.g. you don't need to rollback the whole operation, or you don't care about consistency - you just want to delete everything, if it fails in the middle, you'll just run the query again to delete the rest).

For your case, you could try deleting in batches. For example:

delete top 1000 from session_events where session_id Not In (SELECT distinct session_id FROM sessions)

Repeat until the table is empty.

Also, you have started from the wrong point. You might have been better off creating a foreign key between the two first, and using "on delete cascade". That would automatically delete all the session_events that no longer have a valid session. If you can start over, it might be significantly faster. No promises, though :D

Luaan
  • 62,244
  • 7
  • 97
  • 116
0

Why not use a left join? other alternative is to use EXISTS instead of IN:

DELETE FROM Session_events
WHERE NOT EXISTS(
    SELECT 1
    FROM Session 
    WHERE Session.Session_Id = Session_events.Session_Id
)
Zohar Peled
  • 79,642
  • 10
  • 69
  • 121
  • As I understand that can potentialy give me up to 30% of performance gain, this isn't essential, the task all the same will be lasts in the days.. or weeks.. isn't it? – Brans Ds Apr 14 '15 at 07:47
  • @BransDs I see. Luaan had a better answer then me anyway (and got my vote up). after reading it I remember having kind of the same problem a few years ago. deleting in chunks of 10,000 records was extremely faster. – Zohar Peled Apr 14 '15 at 08:24
0

Sometimes the problem by delete is it is waiting to acquire the lock for all relevant rows. Try to delete in a loop.

DECLARE @MyCursor CURSOR;
DECLARE @MyField YourFieldDataType;//replace with the data type of session_id
BEGIN
    SET @MyCursor = CURSOR FOR
    select session_id from session_events minus select session_id from sessions
    OPEN @MyCursor 
    FETCH NEXT FROM @MyCursor 
    INTO @MyField
    WHILE @@FETCH_STATUS = 0
    BEGIN
        delete session_events where session_id = @MyField
        FETCH NEXT FROM @MyCursor 
        INTO @MyField 
    END;
    CLOSE @MyCursor ;
    DEALLOCATE @MyCursor;
END;

you can also try to rewrite not in to an in:

delete from session_events where session_id in (select session_id from session_events minus select session_id from sessions)
Pavel Gatnar
  • 3,987
  • 2
  • 19
  • 29
  • The OP's engine is MS SQL. Which one is this? The syntax is pretty nifty :)) – Luaan Apr 14 '15 at 07:55
  • @Luaan rewritten to T-SQL (pretty complicated compared to PL/SQL :-) – Pavel Gatnar Apr 14 '15 at 08:07
  • Yeah, cursors are very mouthy in T-SQL. Of course, you're not supposed to use them all that much, but... :D PL/SQL really leaks how the guys had no idea how to model relational queries in so many places; it's convenient in some cases, and outright mind-boggling in others :D – Luaan Apr 14 '15 at 08:10
  • @downvoter thanx for downvoting, but what is wrong on the idea to delete step by step??? – Pavel Gatnar Apr 14 '15 at 08:20
-1
  1. check if there is any index in session_event? if there is any, then disable it
  2. Use NOT EXISTS instead of NOT IN, as EXISTS has better performance than others( as @Zohar Peled write it's query)
  3. If not resolved, then run your select query separately and take a look at execution plan to see what is going to be done when you are executing Select.
Vahid Farahmandian
  • 6,081
  • 7
  • 42
  • 62
-2

try this code

delete e
from session_events e 
left join sessions s (nolock)
    on e.session_id = s.session_id
where s.session_id is null
kutsoff
  • 325
  • 2
  • 7
  • 1
    Downvote reason: You really should read the post and existing answer before writing one yourself. in this case, the OP specifically wrote he is aware of the left join option but not interested in it. – Zohar Peled Apr 14 '15 at 08:19
  • okay, thanks @ZoharPeled. author's query "Delete FROM session_events where session_id Not In (SELECT distinct session_id FROM sessions)" looks bad for DB engine optimization – kutsoff Apr 14 '15 at 08:22
  • You are correct, and as you can see, my suggestion was not much different then yours. however, Luaan's answer is better then both of our answers – Zohar Peled Apr 14 '15 at 08:27
  • Author, does indexes exists on session_id fields? – kutsoff Apr 14 '15 at 08:27
  • @Kutsoff no, the OP has edited the question. The `session_id` columns have no indexes at all. – ypercubeᵀᴹ Apr 14 '15 at 08:32