Database table with million of rows

Question

example i have some gps devices that send info to my database every seconds

so 1 device create 1 row in mysql database with these columns (8)

id=12341 date=22.02.2018 time=22:40 langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd

so for 1 minute it create 60 rows , for 24 hours it create 864000 rows and for 1 month(31days) 2678400 ROWS

so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.) so if there are more devices will be 2.6 Million * number of devices

so my questions are like this:

Question 1: if i make a search like this from php ( just for current day and for 1 device)

SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'

max possible results will be 86400 rows
will it overload my server too much

Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less

  SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000

Question 3: if i show just 1 result from db will it overload server

 SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1

does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result

Hi, and welcome to Stack Overflow. It would help answer your question if we knew what you're doing with these queries. With your first query, I doubt you want to see all 864,000 seconds of the day. I presume you'll do some processing in PHP. It's possible you can do that processing in MySQL instead which is generally much more efficient. — Schwern, Jul 13 '18 at 01:33
3 million rows is not that much, as well as 30 million rows. It depends what you are using them for. Are you querying 100k rows once a day, once an hour, once a minute? Please explain your use case. — The Impaler, Jul 13 '18 at 01:42

Schwern · Accepted Answer · 2018-07-13T22:11:16.540

Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.

Use proper types

Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.

Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.

Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.

Move repeated data into another table.

Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.

Add indexes

Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.

A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.

You can discover whether your queries are using indexes with explain.

`datetime` or `time` + `date`?

Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.

select *
from gps_logs
where date(created_at) = '2018-07-14'

There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.

You can work around this by working with just the datetime column. This will use an index and be efficient.

select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'

Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.

select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'

But it's more involved with a separate date and time.

select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
  or  (created_date = '2018-07-13' and created_time < '10:00:00');

Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.

Do as much work in SQL as possible.

For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.

select gps_device_id, count(id) as num_entries, created_at::date as day 
from gps_logs
group by gps_device_id, day;

 gps_device_id | num_entries |    day     
---------------+-------------+------------
             1 |       29310 | 2018-07-12
             2 |       23923 | 2018-07-11
             2 |       23988 | 2018-07-12

With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.

Avoid `select *`

If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.

Putting it all together.

In PostgreSQL

Your schema in PostgreSQL should look something like this.

create table gps_devices (
    id serial primary key,
    name text not null

    -- any other columns about the devices
);

create table gps_logs (
    id bigserial primary key,
    gps_device_id int references gps_devices(id),
    created_at timestamp not null default current_timestamp,
    latitude numeric(12,9) not null,
    longitude numeric(12,9) not null
);

create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);

A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.

date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.

In MySQL

create table gps_devices (
    id int primary key auto_increment,
    name text not null

    -- any other columns about the devices
);

create table gps_logs (
    id bigint primary key auto_increment,
    gps_device_id int references gps_devices(id),
    foreign key (gps_device_id) references gps_devices(id),
    created_at timestamp not null default current_timestamp,
    latitude numeric(12,9) not null,
    longitude numeric(12,9) not null
);

create index timestamp_and_device on gps_logs(created_at, gps_device_id);

Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.

If he/she only queries a single device at a time maybe the index columns should be `gps_device_is`, then `created_at`. — The Impaler, Jul 13 '18 at 02:47
@TheImpaler The way I've done it covers all bases. If they only query `where gps_device_id = ?` it will use the foreign key index. If they query just `where created_at = ?` it will use `timestamp_and_device` because `created_at` is first. If they query both like `where created_at = ? and gps_device_id = ?` it will use `timestamp_and_device`. — Schwern, Jul 13 '18 at 02:53
You are absolutely right. I totally forgot MySQL creates indexes for FKs without even asking. — The Impaler, Jul 13 '18 at 02:57

score 1 · Answer 2 · answered Jul 13 '18 at 01:35

1

Just read you question, for me the Answer is

Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.

answered Jul 13 '18 at 01:35

Nafees Sardar

133
1
1
10

The Impaler · Answer 3 · 2018-07-13T03:04:21.007

Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.

For example, you could use the structure below:

create table device (
  id int primary key not null,
  name varchar(20),
  someinfo varchar(100)
);

create table location (
  device_id int not null,
  recorded_at timestamp not null,
  latitude double not null, -- instead of varchar; maybe float?
  longitude double not null, -- instead of varchar; maybe float?
  foreign key (device_id) references device (id)
);

create index ix_loc_dev on location (device_id, recorded_at);

If you include the exact queries (naming the columns) we can create better indexes for them.

Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:

create table location (
  device_id tinyint not null,
  recorded_at timestamp not null,
  latitude float not null,
  longitude float not null,
  foreign key (device_id) references device (id)
);

Can't really think of anything smaller than this.

Note that indexing `recorded_at` does not help with a query like `date(recorded_at) = '2017-01-02'`. — Schwern, Jul 13 '18 at 02:32
Yes, expressions should be at the right side of the operator. Maybe just use `between`. — The Impaler, Jul 13 '18 at 02:42
Now I realize the queries won't (most likely) use any index at all. — The Impaler, Jul 13 '18 at 13:40

score 0 · Answer 4 · answered Jul 13 '18 at 03:44

The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.