I have two tables, packages
(with id, name as attributes) and releases
(with url, upload_time, downloaded_bytes as attributes). Every package can have arbitrary many releases. I want a list of all packages with their latest release.
Currently, I have the following working code:
sql = ("SELECT `packages`.`id`, `name` FROM `packages`")
cursor.execute(sql)
packages = cursor.fetchall()
for pkg in packages:
sql = ("SELECT `url` FROM `releases` "
"WHERE `package_id` = %s "
"AND `downloaded_bytes` = 0 "
"ORDER BY `upload_time` DESC LIMIT 1")
cursor.execute(sql, (pkg['id'], ))
url = cursor.fetchone()
if url is not None:
package_url = url['url']
package_analysis.main(pkg['name'], package_url)
logging.info("Package '%s' done.", pkg['name'])
However, I think this is an ugly solution as I execute a lot of queries where I should only execute one query.
Can I do this in one query? How would the query look like?
Please note: I only want one result for each package. That means, the package numpy
should only give the result for url="https://pypi.python.org/packages/cp35/n/numpy/numpy-1.10.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl"
(version 1.10.1) and not 99 results.
What I've tried
SELECT
`packages`.`id`,
`packages`.`name`,
`releases`.`url`,
`releases`.`upload_time`,
`releases`.`release_number`
FROM
`packages`
JOIN
`releases` ON `releases`.`package_id` = `packages`.`id`
GROUP BY
`packages`.`name`
ORDER BY
`releases`.`upload_time` DESC
But that gives a seemingly random value for upload_time
(and also url
).