I'm pretty new to SQL and am trying to join some tables in SQL.
I'm using SQLite3
and Pandas
and have the following table structure:
User
|
Measurement - Environment - meas_device - Device
| |
Data Unit_of_Measurement
Why do I get the result of the following SQL-query multiple times (4x)?
query = """
SELECT User.name, Measurement.id, Data.set_id, Data.subset_id, Data.data
FROM Measurement
JOIN Data ON Measurement.id = Data.measurement_id
JOIN User ON Measurement.user_id = user.id
JOIN Environment ON Measurement.Environment_id = Environment.id
JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
JOIN Device ON meas_device.device_id = Device.id
JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
WHERE User.name = 'nicola'
"""
pd.read_sql_query(query, conn)
My guess is that I did something wrong with the joining, but I can not see what. I hoped to be able to save a JOIN statement somewhere that works for every possible query, that's why more tables are joined than necessary for this query.
Update I think the problem lies within the Environment table. Whenever I join this table the results get multiplied. As the Environment is a collection of meas_devices, there are multiple entries with the same Environment id. (I could save the Environment table with the corresponding meas_device_id's as lists, but then I see no possibility to link the Environment table with the meas_device table.)
id | meas_device_id
1 | 1
1 | 2
1 | 5
2 | 3
2 | 4
Up until now i created the tables with pandas DataFrame.to_sql()
therefore the id is not marked as primary key or something like that. Could this be the reason for my problem
Update 2 I found the problem. I don't think that actually helps somebody in the future. But for the sake of completeness, here the explanation. It was not really a question of how to link the tables but I neglected a crucial link. Because the Environment has multiple indices with the same value it created "open ends" that resulted in a multiplication of the results. I needed to add a cross-check between Environment.subset_id and Data.subset_id. The following query works fine:
query = f""" SELECT {SELECT}
FROM Data
JOIN Measurement ON Data.measurement_id = Measurement.id
JOIN User ON Measurement.user_id = User.id
JOIN Environment ON Measurement.Environment_id = Environment.id
JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
JOIN Device ON meas_device.Device_id = Device.id
JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
WHERE {WHERE} AND Environment.subset_id = Data.subset_id
"""