9

I'm creating an app like Tinder. Which a user can swipe right or like and swipe left or dislike another users. The problem is about storing operations of users. A table is needed for users operations like below

Person 1.   |   Person 2.    |     op
__________________________________
000001.          000007.          Dislike
000001.          000011.          Like
000001.          000053.          Dislike
000001.          000173.          Dislike

It stores operations and also uses for not showing a user more times. It's ok till now.

But the problem is if just 1000 users swipe another 1000 users that table will have 1M of rows. And if 100,000 users do that...It goes to 100M of rows! Which is very huge.

Do u guys have any idea for a structure designe which not to grow so large?

Thank you.

Arnaud Peralta
  • 1,287
  • 1
  • 16
  • 20
Robert junior
  • 140
  • 1
  • 8
  • 1
    Also modern database systems are designed into handling millions or even billions of records just fine when indexes can be used If you really must you can also deploy partitioning on top off that. – Raymond Nijland Mar 11 '19 at 14:27
  • 1
    Storing 100M rows in a (thin) table like this is not a problem. Retrieving sumerized data from it might be a problem depending on what and how you want to select. In that case you might need a trigger managed staging table, which will hold the current state of the relations (one row per relation). – Paul Spiegel Mar 11 '19 at 14:30
  • Yes but it's not just about database. Even if MySQL can handle millions of rows, it need large amounts of RAM and CPU. You know i'm not a sillicon valley company right? ;) I was thinking about a solution...We show a queue of users to a user. What if we set a pointer that tells user swiped untill user number 37 (for example) in the queue and just store likes? And then if that user who is liked dislikes first user we remove this row. How do you think? – Robert junior Mar 11 '19 at 14:37
  • *"Yes but it's not just about database. Even if MySQL can handle millions of rows, it need large amounts of RAM and CPU."* Whats the table structure? `SHOW CREATE TABLE table` Maybe you should normalize the column(s) to hold `INT` OR `BIGINT` instead of `VARCHAR` to save RAM/diskspace.. Also databases are most likely more disk I/O bounded then CPU bounded – Raymond Nijland Mar 11 '19 at 14:40
  • Surely they are user IDs. Which is INT. I'm saying that this structure can grow so fast just with 100,000 users (which is low) imagine about 500,000 users or so...Yes we can do thing to reduce the volume as much as we can. But all this solutions can not solve the main problem. I am searching for another structure idea's. – Robert junior Mar 11 '19 at 14:47
  • 1
    Do what @ArnaudPeralta suggested let the user choose there sexual orientation and add more filters like age, country, state when making a account.. And use filters to only show the user the other user accounts which matches those filters. – Raymond Nijland Mar 11 '19 at 14:47
  • *"Surely they are user IDs. Which is INT. I'm saying that this structure can grow so fast just with 100,000 users (which is low) imagine about 500,000 users or so...Yes we can do thing to reduce the volume as much as we can."* 500,000 INT's is like 2 Mb RAM and disk usage so that's like nothing nowadays.... Also then you use country as a filter and required setting for a user account you can deploy list/range partitioning to separate a big table into more smaller tables on a country range. So it's a win/win.. – Raymond Nijland Mar 11 '19 at 14:53
  • I'm not quite getting the point. If you need that data, you have no choice than to save it. The schema you show can hardly be simplified. When you now try to save storage space, then you are developing a compression/deduplication algorithm. I don't think you want to do that. At least it's to broad for SO. 100M rows in that format is something like 3-10GB (depending on indexes and storage engine). That doesn't need to be completely in RAM. I can get a dedicated server with 32GB RAM for like $30/month from my provider. On a cloud service you might pay even less. IMHO not that much. – Paul Spiegel Mar 11 '19 at 15:28

3 Answers3

5

There are a couple of things to consider.

Firstly, the size of a table is not hugely interesting unless you know the types of query you need to run. As others have said, tables with hundreds of millions of rows are nothing to be afraid of, and if you're querying on indexable fields, you can probably scale to billions of rows without going to exotic solutions just by buying bigger and better hardware. So, a solution where 90% of your queries are

select * from users where user_id not in (select interacted_user_id from interactions where interacting_user_id = $current_user) limit 10

My guess is this will scale to hundreds of millions of rows on your laptop, and billions of rows on a decent server. My strong recommendation is to use a simple, relational solution without partitioning or other exotic solutions until you've scaled to the point where that no longer works, and you have tuned your queries and upgraded your hardware as far as possible. This is much cheaper/easier than any other solution.

The bigger challenge will be the geospatial aspect - presumably, you want to order your results based on distance from the current user.

One way in which you might partition your data is to gather "interactions" by region. This requires some thought - you probably don't want "hard" boundaries, but rather have overlapping geographies. Each spot on the map might have a few overlapping "regions", each with its own table. The more users you have in a region, the smaller you make the overlapping circles - Manhattan might have 3 regions, Greenland might have just 1. Your query then looks at the tables for each overlapping region, and unions the users who haven't previously interacted with the current user.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
4

If the person 1 disliked the person 2 there is no need to show the person 1 to the person 2. Even if you show him, they will never match. Therefore your calculation 1K x 1K = 1M is a bit overestimated.

However, if you still want to have sets of likes/dislikes for both users, you might consider this terrible idea of "compressing" rows.

Imagine you have a sequence like this:

| Person 1 | Person 2 |  Op       |
| -------- | -------- | --------- |
| 0001     | 1010     |  Dislike  |
| 0001     | 1011     |  Dislike  |
| 0001     | 1012     |  Dislike  |
| 0001     | 1013     |  Dislike  |
| 0001     | 1015     |  Like     |
| 0001     | 1017     |  Dislike  |
| 0001     | 1018     |  Dislike  |
| 0001     | 1019     |  Dislike  |
| 0001     | 1021     |  Like     |

If you have ids following each other you might show them as

| Person 1 | Person 2 |  Op       | N    |
| -------- | -------- | --------- | ---- |
| 0001     | 1010     |  Dislike  | 3    |
| 0001     | 1015     |  Like     | 0    |
| 0001     | 1017     |  Dislike  | 2    |
| 0001     | 1021     |  Like     | 0    |

where N is max id in a sequence (ex. 1010 + 3 = 1013). If you define N as an unsigned tinyint then the max possible size of the sequence can be 255, that means, in theory, 255 sequential dislikes/likes can be saved as 1 record.

And the query would be something like this (imagine you are looking for id 1013):

SELECT a.* 
FROM (
    SELECT *
    FROM `table`
    WHERE person_1 = 0001
      AND person_2 >= (1013 - 255) -- 255 is a max size of a sequense 
      AND person_2 <= 1013
) a
WHERE a.person_2 <= 1013 AND a.person_2 + N >= 1013

The sub-select will limit the range of possible records then the main select will match the record if it exists. In this case, it will be

| Person 1 | Person 2 |  Op       | N    |
| -------- | -------- | --------- | ---- |
| 0001     | 1010     |  Dislike  | 3    |

But, personally, I would go with this and prefer your current solution because of its simplicity.

OR as another variant, you might compress the table this way

| Person 1 | Person 2 | Max Person 2 |  Op       |
| -------- | -------- | ------------ | --------- |
| 0001     | 1010     | 1013         |  Dislike  |
| 0001     | 1015     | 1015         |  Like     |
| 0001     | 1017     | 1019         |  Dislike  |
| 0001     | 1021     | 1021         |  Like     |
1

You will never have 1M rows, because if you are doing a Tinder-like app, you can re-match people. So i suggest you to add a date column to know when you can delete the row and stored procedure wich you can execute for clean-up the expired relations.

With this column rows will not stack and you will never have millions of rows.

Also you don't need to store when people likes together.

EDIT : and why not CHECKSUM() with both columns for storing hash for each relations ? It will be lighter.

EDIT2: and don't forget that is a love app. And people don't match with everyone because they got sexual orientation.

Arnaud Peralta
  • 1,287
  • 1
  • 16
  • 20
  • An operation never expires. Because if user A likes or dislikes user B, user B must not be shown to user A again. That table also is using for this purpose. – Robert junior Mar 11 '19 at 14:12
  • So for me the best thing you can do is to be carreful when you UPDATE your table, for not update `A | B | dislike` when there is already a `B | A | dislike`. And to delete a row if the second person put a like when the first did it too. – Arnaud Peralta Mar 11 '19 at 14:14
  • *"for not update A | B | dislike when there is already a B | A | dislike"* The topicstarter table is a source destination table.. If person A dislikes person B why should it not be possible that person B also dislike person A? Or like for that matter. – Raymond Nijland Mar 11 '19 at 14:18
  • If someone dislike a person, why this person should have the suggestion of this person ? its useless. So it's save 1 row anyways. – Arnaud Peralta Mar 11 '19 at 14:18
  • because just in real live one person can dislike a person, but the disliked person can still like the person that disliked him/her.. That seams the topicstarters point also.. – Raymond Nijland Mar 11 '19 at 14:20
  • We need confirmation on this point, he didn't really says that explicitly. But in anyways, he will never have 1M rows because of sexual orientation. – Arnaud Peralta Mar 11 '19 at 14:21
  • *"We need confirmation on this point, he didn't really says that explicitly."* True and you are also are right about adding a sexual orientation filter or selection that would reduce the number greatly. *"he will never have 1M rows because of sexual orientation. "* Well there are billions of poeple never say never – Raymond Nijland Mar 11 '19 at 14:22
  • Guys we are not talking about specific conditions. I understand what you say. But in a database design you should consider the worst condition. Which in this case is most of users like most of another users. – Robert junior Mar 11 '19 at 14:25
  • So, I don't there is a better database structure for your app. But why not try to hash a relation. For store the hash in one column. With CHECKSUM() for example. – Arnaud Peralta Mar 11 '19 at 14:27
  • @Robertjunior see mine comment under your question if you still doubt.. – Raymond Nijland Mar 11 '19 at 14:29
  • `CHECKSUM()` will not work on MySQL, `CHECKSUM()` is TSQL only which is SQL Server (MSSQL) – Raymond Nijland Mar 11 '19 at 14:29
  • Hashing makes database not normalized. And you can't do query on it. – Robert junior Mar 11 '19 at 14:29
  • So, keep like this your table. – Arnaud Peralta Mar 11 '19 at 14:31