2

I am working with the Yelp data-set available online. I've been trying to optimize my query for days. For the schema I'll list below, I need to construct a query to provide the following:

  • Given a user's UID, display the most recent review information for each of the user's friends.

Here's the schema:

CREATE TABLE business(
    bid varchar(40) PRIMARY KEY,
    name varchar(100),
    city varchar(40),
    state char(2),
    zip varchar(10),
    latitude real,
    longitude real,
    address varchar(100),
    numreviews INTEGER DEFAULT 0,
    numcheckins INTEGER DEFAULT 0,
    avgreview float DEFAULT 0,
    isopen bool,
    stars float
);

CREATE TABLE users(
    uid varchar(40) PRIMARY KEY,
    name varchar(40),
    avgstars float,
    fans INTEGER,
    coolvotes INTEGER,
    reviewcount INTEGER,
    funnyvotes INTEGER,
    signup varchar(20),
    usefulvotes INTEGER,
    latitude real,
    longitude real
);

CREATE TABLE reviews(
    rid varchar(40) PRIMARY KEY,
    bid varchar(40),
    uid varchar(40),
    stars float,
    date varchar(20),
    funny INTEGER,
    useful INTEGER, 
    cool INTEGER,
    text varchar(1024),
    FOREIGN KEY (uid) REFERENCES users(uid),
    FOREIGN KEY (bid) REFERENCES business(bid)
);

CREATE TABLE friends(
    uid varchar(40) REFERENCES users(uid),
    fid varchar(40) REFERENCES users(uid)
);

Here's an example of the desired output: enter image description here

For each of the user's friends, I display the following:

  • The friend's name
  • The name of the business from their most recent review
  • The city of the business from their most recent review
  • The text from their most recent review

Currently this is the only "solution" I've had success with.

Step 1: Get a list of all of the IDs for each of the user's friends.

SELECT fid from friends where uid = '{userId}'

This returns a list of all of the user ID's for each of the user's friends. So I basically have a friend ID list.

Step 2: With this information, I run a foreach loop in my program over that list. For each iteration of the friend ID list, I execute the below query and provide the temporary friend ID for the current iteration of the loop:

SELECT U.name, B.name, B.city, R.text, R.date FROM reviews as R, users as U, business as B
WHERE U.uid = '{currentFriendId}'
AND R.uid = '{currentFriendId}'
AND B.bid = R.bid
AND date = (SELECT MAX(date) FROM reviews WHERE uid = '{currentFriendId}')

For EACH time I run this for loop, I get a single line of output for what I desire, such as this:

enter image description here

This is great...except I have to run this query for every single one of the user's friends. This is extremely costly.

Goal: I'm trying to combine these 2 queries, or revamp them completely, to generate all of the rows at once in a single query.

Question: Given the information provided, how can I fix my queries to generate all of this information from a single query?

Birdman
  • 1,404
  • 5
  • 22
  • 49

5 Answers5

1

It looks like a top-n-per-group problem.

One way to do it is to use a lateral join.

Make sure you have an index on reviews table on (uid, date). A composite index. One index on two columns in this order.

Something like this:

CREATE INDEX IX_uid_date ON reviews (uid, date);

Query

SELECT
     t.UserName
    ,t.BusinessName
    ,t.city
    ,t.text
    ,t.date
FROM
    friends
    INNER JOIN LATERAL
    (
        SELECT
            users.name AS UserName
            ,business.name AS BusinessName
            ,business.city
            ,reviews.text
            ,reviews.date
        FROM
            reviews
            INNER JOIN users ON users.uid = reviews.uid
            INNER JOIN business ON business.bid = reviews.bid
        WHERE
            reviews.uid = friends.fid
        ORDER BY reviews.date DESC
        LIMIT 1
    ) AS t ON true
WHERE
    friends.uid = '{userId}'
;
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • Absolutely tragically, after trying to figure this out for countless hours, I figured it out....literally about 5 minutes before your post. I checked your results and they are valid though, so I will award you the bounty. – Birdman May 16 '19 at 08:29
  • 1
    @Birdman, congratulations! There is nothing tragic when you learn something on your own. It always takes some time when you encounter some problem for the first time, but you tend to understand and remember the solution better when you solve it yourself. I'd recommend you to leave the bounty open for the maximum duration, because you may get other, better answers. – Vladimir Baranov May 16 '19 at 08:38
0

This should work fine.

SELECT name FROM employees as E
WHERE E.uid IN (SELECT uid FROM employees WHERE name = 'John')

You do not need to do an equal comparison as in single truth value match queries.

Manu mathew
  • 859
  • 8
  • 25
  • Thank you, but I feel my example was too simple. Can you please look at my edited question? I added my actual code and problem. – Birdman May 14 '19 at 05:38
0

Following up on Manos' answer, not sure I understand why you need to limit each fid at all

SELECT U.name, B.name, B.city, R.text, R.date 
    FROM business AS B 
    INNER JOIN reviews AS R ON B.bid = R.bid 
    INNER JOIN users AS U ON R.uid = U.uid 
    WHERE (R.date = (SELECT MAX(X.date) FROM reviews AS X WHERE X.uid = R.uid)) 
    AND (R.uid IN (SELECT fid FROM friends));

If your issue is that your query only results in one row, you should remove that where uid = to get results for all uid's.

Eli
  • 94
  • 11
0

I have also arrived at an answer the same time roughly as Vladimir Baranov, but I will post my version as well. I don't promise it to be pretty:

SELECT R.name as user_name, B.name as business_name, B.City, R.text
FROM (SELECT bid, name, text 
     FROM (SELECT R.rid, R.bid, R.uid, R.text, max_date
        FROM reviews as R INNER JOIN 
           (SELECT uid, MAX(date) as max_date FROM reviews WHERE uid IN (SELECT fid from friends where uid = 'BfcNxKpnF9z5wJLXY7elRg') GROUP BY uid) sub
            ON R.uid = sub.uid AND R.date = sub.max_date) as review_info
     INNER JOIN users
     on review_info.uid = users.uid) as R
INNER JOIN business as B
ON R.bid = B.bid
Birdman
  • 1,404
  • 5
  • 22
  • 49
  • There is often several ways to write an SQL query that produces the same result. I would recommend you to try all proposed queries, check which of them produce correct result and compare their performance on your real data and hardware. Different queries can have a drastically different performance, even when they produce the same result. – Vladimir Baranov May 16 '19 at 10:36
-1

After examining the schema you posted, I used MySQL to create the database and populate the tables with the following sample data:

INSERT INTO users (uid, name) VALUES
('user1', 'user1 name'),
('user2', 'user2 name'),
('user3', 'user3 name'),
('user4', 'user4 name'),
('user5', 'user5 name');

INSERT INTO friends (uid, fid) VALUES
('user1', 'user2'), ('user1', 'user3'),
('user2', 'user4'), ('user2', 'user5');

INSERT INTO business (bid, name, city) VALUES
('b1', 'business 1', 'city 1'),
('b2', 'business 2', 'city 2'),
('b3', 'business 3', 'city 3'),
('b4', 'business 4', 'city 4');

INSERT INTO reviews (rid, bid, uid, stars, date, text) VALUES
('r1', 'b1', 'user1', 5, '2019-05-01', 'blah'),
('r2', 'b2', 'user1', 5, '2019-05-02', 'blah'),
('r3', 'b3', 'user1', 5, '2019-05-03', 'blah'),
('r4', 'b1', 'user2', 4, '2019-05-11', 'blah'),
('r5', 'b2', 'user3', 3, '2019-05-12', 'blah'),
('r6', 'b1', 'user4', 5, '2019-05-13', 'blah');

This allowed me to verify that the original solution I proposed was correct by executing the query in MySQL Workbench. I assume that the 'failure to finish' you mention has nothing to do with the query per se, but is rather a temporary failure of the DB connection api you use. Note that the code is updated to incorporate Mihail Shishkov's proposal for using parameters.

-- Display review information originating from friends of user1
-- DECLARE @UID varchar(40);    -- Uncomment for MS-SQL (variables need to be declared)
SET @UID = 'user1';

SELECT U.name, B.name, B.city, R.text, R.date 
FROM business AS B
INNER JOIN reviews AS R ON B.bid = R.bid
INNER JOIN users AS U ON R.uid = U.uid
WHERE (R.date = (SELECT MAX(X.date) FROM reviews AS X WHERE (X.uid = R.uid)))
  AND (R.uid IN (SELECT F.fid FROM friends AS F WHERE (F.uid = @UID)));

Based on the sample data and using 'user1' as the value for the @UID parameter, the results of the query are:

name        name        city    text  date
------------------------------------------------
user2 name  business 1  city 1  blah  2019-05-11
user3 name  business 2  city 2  blah  2019-05-12

Moreover, I assume that friendship is a two-way relationship in the context of your schema (as in the real world), meaning that friendship between 'user1' and 'user2' only needs to be defined by a single record in table 'friends' with the values ('user1', 'user2') and the reverse ('user2', 'user1') is unnecessary. So, for the sake of completeness, you can use the following query:

-- Display review information originating from friends of user2
SET @UID = 'user2';

SELECT U.name, B.name, B.city, R.text, R.date 
FROM business AS B
INNER JOIN reviews AS R ON B.bid = R.bid
INNER JOIN users AS U ON R.uid = U.uid
WHERE (R.date = (SELECT MAX(X.date) FROM reviews AS X WHERE (X.uid = R.uid)))
  AND (R.uid IN (SELECT F.fid FROM friends AS F WHERE (F.uid = @UID) UNION
                 SELECT F.uid FROM friends AS F WHERE (F.fid = @UID)));

Now, using 'user2' as the value for the @UID parameter and the extended version of the query, we obtain the following results:

name        name        city    text  date
------------------------------------------------
user1 name  business 3  city 3  blah  2019-05-03
user4 name  business 1  city 1  blah  2019-05-13

I would appreciate it if you acknowledge the answer as acceptable.

  • Hi Manos. Your query never finished (not sure why). I thoroughly edited my answer. I included the schema, and replaced the code with information which I think is much more relevant. Could you take another look? – Birdman May 14 '19 at 06:42
  • Do not ever concatenate strings to generate SQL queries. Use parameters instead. https://stackoverflow.com/questions/3216233/what-is-passing-parameters-to-sql-and-why-do-i-need-it – Mihail Shishkov May 14 '19 at 06:46
  • @MihailShishkov Thanks, I will take a look but that doesn't solve me issue. – Birdman May 14 '19 at 06:51
  • I'm sorry the edited solution still does not work. My database has a few million records among all the tables. I'm not sure if this is a super costly query or what, but it does not finish. (Even my query that returns a single row, when I run that 30 times, it'll finish in 5 seconds or so...) – Birdman May 16 '19 at 03:06
  • Furthermore, I am using PostgresSQL, as shown in the tag. I could not use the SET feature as show, but I tried running your query with a hardcoded UID swapped in for your variable. – Birdman May 16 '19 at 03:10
  • For starters I failed to take into account the fact that you are using PostgreSQL. In any case, the variable assignment (SET) was only used in the context of MySQL Workbench in order to simulate the behaviour of your application. – Manos Anastasiadis May 16 '19 at 04:47
  • Furthermore, the size of your database clearly affects performance. You definitely need indexes to the primary and foreign keys if you haven't got already. You also need to consider using INT instead of VARCHAR for your keys data type. – Manos Anastasiadis May 16 '19 at 04:53