0

I have a table of events generated by devices, with this structure:

class Events(db.Model):
    id = db.Column(db.Integer, primary_key=True, autoincrement=True)
    timestamp_event = db.Column(db.DateTime, nullable=False, index=True)
    device_id = db.Column(db.Integer, db.ForeignKey('devices.id'), nullable=True)

which I have to query joined to:

class Devices(db.Model):
    id = db.Column(db.Integer, primary_key=True, autoincrement=True)
    dev_name = db.Column(db.String(50)) 

so I can retrieve Device data for every Event.

I´m doing a ranking of the 20 top max events generated in a single hour. It already works, but as my Events table grows (over 1M rows now) the query gets slower and slower. This is my code. Any ideas on how to optimize the query? Maybe a composite index device.id + timestamp_event? Would that work even if searching for a part of the timedate column?

pkd = db.session.query(db.func.count(Events.id), 
                             db.func.date_format(Events.timestamp_event,'%d/%m %H'),\
                             Devices.dev_name).select_from(Events).join(Devices)\
                            .filter(Events.timestamp_event >= (datetime.now() - timedelta(days=peak_days)))\
                            .group_by(db.func.date_format(Events.timestamp_event,'%Y%M%D%H'))\
                            .group_by(Events.device_id)\
                            .order_by(db.func.count(Events.id).desc()).limit(20).all()

Here´s sample output of first 3 rows of the query: Number of events, when (DD/MM HH), and which device:

[(2710, '15/01 16', 'Device 002'), 
(2612, '11/01 17', 'Device 033'),
(2133, '13/01 15', 'Device 002'),...]

and here´s SQL generated by SQLAlchemy:

SELECT count(events.id) AS count_1, 
date_format(events.timestamp_event, 
%(date_format_2)s) AS date_format_1, 
devices.id AS devices_id, 
devices.dev_name AS devices_dev_name 
FROM events 
INNER JOIN devices ON devices.id = events.device_id 
WHERE events.timestamp_event >= %(timestamp_event_1)s 
GROUP BY date_format(events.timestamp_event, %(date_format_3)s), events.device_id 
ORDER BY count(events.id) DESC 
LIMIT %(param_1)s
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
PiBer2
  • 159
  • 10
  • The filter is used to look only in the last "peak_days" days, so old maxs will be discarded. – PiBer2 Jan 28 '22 at 17:54
  • 1
    I'd suggest tagging this with the RDBMS that you are using, and replacing the SQLAlchemy stuff with the query that SQLAlchemy generates. This is essentially a database question. You would need to provide table DDL and qury plan too. – snakecharmerb Jan 28 '22 at 17:57
  • A question with the tag is supposed to contain some SQL. – jarlh Jan 28 '22 at 18:10
  • @jarlh I have added SQL. Thanks. – PiBer2 Jan 28 '22 at 19:27
  • @PiBer2 Its hard to tell if the join is having an effect without the query plan but I think this would be easier if you denormalized the datetime to year, month, day, hour. I don't see how any indexing could be used when you have to compute the group by on the fly. – Ian Wilson Jan 29 '22 at 23:16
  • Also how does the query even work with a selected date format different from the grouped by format ? – Ian Wilson Jan 29 '22 at 23:16

1 Answers1

1

# This example is for postgresql.
# I'm not sure what db you are using but the date formatting
# is different.

with Session(engine) as session:
    # Use subquery to select top 20 event creating device ids 
    # for each hour since the beginning of the peak.
    hour_fmt = "dd/Mon HH24"
    hour_col = func.to_char(Event.created_on, hour_fmt).label('event_hour')
    event_count_col = func.count(Event.id).label('event_count')
    sub_q = select(
        event_count_col,
        hour_col,
        Event.device_id
    ).filter(
        Event.created_on > get_start_of_peak()
    ).group_by(
        hour_col, Event.device_id
    ).order_by(
        event_count_col.desc()
    ).limit(
        20
    ).alias()

    # Now join in the devices to the top ids to get the names.
    results = session.execute(
        select(
            sub_q.c.event_count,
            sub_q.c.event_hour,
            Device.name
        ).join_from(
            sub_q,
            Device,
            sub_q.c.device_id == Device.id
        ).order_by(
            sub_q.c.event_count.desc(),
            Device.name
        )
    ).all()
Ian Wilson
  • 6,223
  • 1
  • 16
  • 24
  • Ok! So your approach to optimization is: "first get the peak hour counts, then join them to the devices table"... I guess in that way, keeping the devices out from the main query should speed it up. Thanks, I´ll give it a try and let you know. – PiBer2 Jan 29 '22 at 23:32
  • You might want to fix the select/group by mismatch because who knows what that is doing. – Ian Wilson Jan 29 '22 at 23:49
  • This approach reduced the query time from 25 secs to 6 secs! The first part returns a subquery (reference at [link] (https://stackoverflow.com/questions/38878897/how-to-make-a-subquery-in-sqlalchemy) ) and then the subquery is joined with the devices table. Thanks a lot! – PiBer2 Jan 30 '22 at 03:05