-1

I'm pretty new to SQL and am trying to join some tables in SQL. I'm using SQLite3 and Pandas and have the following table structure:

    User
      |
Measurement - Environment - meas_device - Device
      |                          | 
    Data                 Unit_of_Measurement

Why do I get the result of the following SQL-query multiple times (4x)?

query = """
SELECT User.name, Measurement.id, Data.set_id, Data.subset_id, Data.data 
FROM Measurement
JOIN Data ON Measurement.id = Data.measurement_id
JOIN User ON Measurement.user_id = user.id
JOIN Environment ON Measurement.Environment_id = Environment.id
JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
JOIN Device ON meas_device.device_id = Device.id
JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
WHERE User.name = 'nicola'
"""

pd.read_sql_query(query, conn)

My guess is that I did something wrong with the joining, but I can not see what. I hoped to be able to save a JOIN statement somewhere that works for every possible query, that's why more tables are joined than necessary for this query.

Update I think the problem lies within the Environment table. Whenever I join this table the results get multiplied. As the Environment is a collection of meas_devices, there are multiple entries with the same Environment id. (I could save the Environment table with the corresponding meas_device_id's as lists, but then I see no possibility to link the Environment table with the meas_device table.)

id | meas_device_id
1  |      1
1  |      2
1  |      5
2  |      3
2  |      4

Up until now i created the tables with pandas DataFrame.to_sql() therefore the id is not marked as primary key or something like that. Could this be the reason for my problem

Update 2 I found the problem. I don't think that actually helps somebody in the future. But for the sake of completeness, here the explanation. It was not really a question of how to link the tables but I neglected a crucial link. Because the Environment has multiple indices with the same value it created "open ends" that resulted in a multiplication of the results. I needed to add a cross-check between Environment.subset_id and Data.subset_id. The following query works fine:

query = f""" SELECT {SELECT}
          FROM Data
          JOIN Measurement ON Data.measurement_id = Measurement.id
          JOIN User ON Measurement.user_id = User.id
          JOIN Environment ON Measurement.Environment_id = Environment.id
          JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
          JOIN Device ON meas_device.Device_id = Device.id
          JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
          WHERE {WHERE} AND Environment.subset_id = Data.subset_id
          """
General Grievance
  • 4,555
  • 31
  • 31
  • 45
Yehla
  • 199
  • 11
  • Considering `pd.merge` is equivalent to SQL's `JOIN`, this is pretty much the same situation as in [this question](https://stackoverflow.com/questions/51665680/pandas-dataframe-merge-ends-up-with-more-rows) – Quang Hoang Feb 23 '21 at 17:22
  • You are joining tables whose columns are not required. Probably one of them returns more than one record per expected result record. Wouldn't joining on `Measurement`, `Data` and `User` alone do what you need? – Olivier Jacot-Descombes Feb 23 '21 at 17:28
  • Yes in this case it would. I hoped to save a join statement somewhere that that works for every possible query. – Yehla Feb 23 '21 at 17:31

1 Answers1

1

If you need to filter on tables that produce additional rows in the result the when they are joined, don't join them and instead include them in a sub-query in the WHERE clause.

E.g.

SELECT User.name, Measurement.id, Data.set_id, Data.subset_id, Data.data
FROM
    Measurement
    JOIN Data ON Measurement.id = Data.measurement_id
    JOIN User ON Measurement.user_id = user.id
WHERE
    Measurement.Environment_id IN (
        SELECT Environment.id
        FROM
            Environment
            JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
            JOIN Device ON meas_device.device_id = Device.id
            JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
        WHERE Device.name = 'xy'
    )

In this subquery you can join many tables without generating additional records.

If this is not an option because you want to select entries from other tables as well, you can simply add a DISTINCT to you original query.

SELECT DISTINCT
    User.name, Measurement.id, Data.set_id, Data.subset_id, Data.data 
FROM
   Measurement
   JOIN Data ON Measurement.id = Data.measurement_id
   JOIN User ON Measurement.user_id = user.id
   JOIN Environment ON Measurement.Environment_id = Environment.id
   JOIN meas_device ON Environment.meas_dev_ids = meas_device.id
   JOIN Device ON meas_device.device_id = Device.id
   JOIN Unit_of_Measurement ON meas_device.Unit_id = Unit_of_Measurement.id
WHERE
   User.name = 'nicola'
Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
  • With a very small modification that actually works, thanks. I found the original problem lies with: `JOIN Environment ON Measurement.Environment_id = Environment.id` do you have any idea why this multiplies my results? – Yehla Feb 24 '21 at 06:00
  • I just realized this doesn't solve my problem. I'd like to select entries from meas_device, device and unit_of_measurement with the same query as well. And this doesn't work like this – Yehla Feb 24 '21 at 06:33
  • If you want to select entries from other tables as well, then the problem is different from what you posted. – Olivier Jacot-Descombes Feb 24 '21 at 13:16
  • Thanks for your help Olivier Jacot-Descombes. You were right, I found the error and it lay within a link between two tables I didn't realize had to make. – Yehla Feb 24 '21 at 15:03