1

I am using MySQL through R. I am working with two tables within the same database and I noticed something strange that I can't explain. To be more specific, when I try to make a connection between the tables using a foreign key the result is not what it should be.

One table is called Genotype_microsatellites, the second table is called Records_morpho. They are connected through the foreign key sample_id.

If I only select records with certain characteristics from the Genotype_microsatellites table using the following command...

Gen_msat <- dbGetQuery(mydb, 'SELECT * 
                   FROM Genotype_microsatellites
                   WHERE CIDK113a >= 0')

...the query returns 546 observations for 52 variables, exactly what I would expect. Now, I want to do a query that adds a little more info to my results, specifically by including data from the Records_morpho table. I, therefore, use the following code:

Gen_msat <- dbGetQuery(mydb, 'SELECT  Genotype_microsatellites.*,
                   Records_morpho.net_mass_g,
                   Records_morpho.svl_mm 
                   FROM Genotype_microsatellites
                   INNER JOIN Records_morpho ON Genotype_microsatellites.sample_id = Records_morpho.sample_id 
                   WHERE CIDK113a >= 0')

The problem is that now the output has 890 observation and 54 variables!! Some sample_id values (i.e., the rows or individuals in the data frame ) are showing up multiple times, which shouldn't be the case. I have tried to fix this using SLECT DISTINCT, but the problem wouldn't go away.

Any help would be much appreciated.

Cinghio
  • 302
  • 1
  • 4
  • 14

1 Answers1

5

Sounds like it is working as intended, that is how joins work. With A JOIN B ON A.x = B.y you get every row from A combined with every row from B that has a y matching the A row's x. If there are 3 rows in B that match one row in A, you will get three result rows for those. The A row's data will be repeated for each B row match.

To go a little further, if x is not unique and y is not unique. And you have two x with the same value, and three y with that value, they will produce six result rows.

As you mentioned DISTINCT does not make this problem go away because DISTINCT operates across the result row. It will only merge result rows if the values in all selected fields are the same on those result rows. Similarly, if you have a query on a single table that has duplicate rows, DISTINCT will merge those rows despite them being separate rows, as they do not have distinct sets of values.

Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • Totally makes sense. I guess I need to figure out a way to parse out the extra records from the results. Thank you. – Cinghio Jun 07 '18 at 21:49
  • If you have multiple records for the same sample, which one(s) do you want in the results? If you need all the record data (B), you typically do a join and just ignore the redundant sample data (A) when handling the results; but if the A data is large enough that it's redundancy is problematic, it can be better to query for the B data separately and "join" programmatically in the client. – Uueerdo Jun 07 '18 at 21:50
  • The Records_morpho table contains morphological features. For those samples with multiple entries (that have been captured more than once) I could select the one entry with the largest value of one of the morphological features , like Records_morpho.net_mass_g – Cinghio Jun 07 '18 at 21:54
  • This might help for that https://stackoverflow.com/questions/3619030/mysql-join-the-most-recent-row-only (_instead of "most recent" you'd be going for "largest", so adjust as needed)_ – Uueerdo Jun 07 '18 at 22:31
  • You have the SELECT statement in SQL to parse out the columns you need – Daniel Viglione Oct 31 '22 at 16:16