0

I have two tables, packages (with id, name as attributes) and releases (with url, upload_time, downloaded_bytes as attributes). Every package can have arbitrary many releases. I want a list of all packages with their latest release.

Currently, I have the following working code:

sql = ("SELECT `packages`.`id`, `name` FROM `packages`")
cursor.execute(sql)
packages = cursor.fetchall()
for pkg in packages:
    sql = ("SELECT `url` FROM `releases` "
           "WHERE `package_id` = %s "
           "AND `downloaded_bytes` = 0 "
           "ORDER BY `upload_time` DESC LIMIT 1")
    cursor.execute(sql, (pkg['id'], ))
    url = cursor.fetchone()
    if url is not None:
        package_url = url['url']
        package_analysis.main(pkg['name'], package_url)
        logging.info("Package '%s' done.", pkg['name'])

However, I think this is an ugly solution as I execute a lot of queries where I should only execute one query.

Can I do this in one query? How would the query look like?

Please note: I only want one result for each package. That means, the package numpy should only give the result for url="https://pypi.python.org/packages/cp35/n/numpy/numpy-1.10.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl" (version 1.10.1) and not 99 results.

What I've tried

SELECT
    `packages`.`id`,
    `packages`.`name`,
    `releases`.`url`,
    `releases`.`upload_time`,
    `releases`.`release_number`
FROM
    `packages`
JOIN
    `releases` ON `releases`.`package_id` = `packages`.`id`
GROUP BY
    `packages`.`name`
ORDER BY
    `releases`.`upload_time` DESC

But that gives a seemingly random value for upload_time (and also url).

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    You might want to look at this question: http://stackoverflow.com/questions/12102200/get-records-with-max-value-for-each-group-of-grouped-sql-results That's the general form of the problem. The (in my opinion) best answer is using left joins. – Emiswelt Dec 09 '15 at 09:27

2 Answers2

0

You can try this query:

select p.id, p.name, r.url, r.upload_time, r.release_number from
(select p.id, max(r.release_number) release_number from packages p
join releases r on p.id = r.package_id
group by p.id) a
join packages p on p.id = a.id
join releases r on r.release_number = a.release_number

It assumes that release_number is sortable, if not possible you can use max upload time instead.

Espen Burud
  • 1,871
  • 10
  • 9
0

Based on this answer (thank you Emiswelt) for mentioning it:

SELECT
    `packages`.`id`,
    `packages`.`name`,
    `o`.`url`,
    `o`.`upload_time`,
    `o`.`release_number`
FROM
    `releases` o
LEFT JOIN
    `releases` b  ON `o`.`package_id` = `b`.`package_id`
    AND `o`.`upload_time` < `b`.`upload_time`
JOIN
    `packages` ON `packages`.`id` = o.package_id
WHERE
    `b`.`upload_time` is NULL
    AND `o`.`downloaded_bytes` = 0
ORDER BY
    `packages`.`name`
LIMIT 10

The query finishes execution within a fraction of a second.

Community
  • 1
  • 1
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958