2

I'm trying to do something that's very simple to do in other languages but in SQL it's proving rather puzzling.

I have a database with the patient ID as row, and 100 symptoms as columns. Symptoms are binary, 0 or 1 if the patient has it or not. Let's say Patient 1 has 0, 1, 1, ... for coughing, sneezing, headaches, etc. Patient 2 similarly has a binary list of symptoms 1, 0, 1, ....

What I want to do is simply generate a table that has counts for pairwise symptoms that number of patients that have that combination of symptoms.

So when I look up in the new table, I can see how many people have both sneezing and coughing. How many have both headache and sneezing and so forth. Only pairwise. If I look up sneezing and sneezing, it would give me back the number of people who have sneezing. Something like a matrix format of symptoms in both rows and columns and number of patients suffering from it in the content.

With a for loop this is incredibly easy in any other language. I'm new to SQL though, and struggling to figure out an efficient way to do this.

Alisa
  • 2,892
  • 3
  • 31
  • 44
mab
  • 73
  • 1
  • 7
  • 3
    What database is this? SQL Server? – Donal Jun 08 '15 at 16:58
  • So you want a matrix where `a[I][J]` is the number of patients that have symptoms `I` and `J`, right? – PM 77-1 Jun 08 '15 at 16:59
  • 1
    Yes, exactly a matrix with I and j as symptoms. – mab Jun 08 '15 at 17:01
  • 2
    If you have control over the structure, I suggest to create a table to hold symptoms – Fabien TheSolution Jun 08 '15 at 17:04
  • 1
    The optimal solution for this will vary by DBMS, and you appear to be ignoring Donal's question about which one it is. If you want an answer you need to tag your question with the Vendor, and preferably version of the database you are using – GarethD Jun 08 '15 at 17:08
  • 4
    The key is to unpivot your data so that you have one row per patient per symptom, then join this data to itself to get pairs of symptoms, then pivot the joined data back up to get your counts. Since I don't know which DBMS to use, I cannot answer the question, I have however created a [working demo in SQL Server](http://sqlfiddle.com/#!3/7ea17/1) that should demonstrate how to do it. – GarethD Jun 08 '15 at 17:17
  • Note that SQL, be it T-SQL, PLSQL or HQL, is vastly different from other programming languages. You have to start thinking in terms of set based operations and relations, not necessarily doing things to particular values in particular columns individually. With most RDBMSs, you don't have operations analogous to for loops because you're supposed to tell the engine _what_ data you want, not _how_ the engine should go about getting it. – Patrick Tucci Jun 08 '15 at 17:21
  • @GarethD - Just brilliant. I thought it would be impossible with a single (static) query. I also love your method for creating massive test data. – PM 77-1 Jun 08 '15 at 17:25
  • Which database are you using? DB2? – John Saunders Jun 08 '15 at 17:47
  • 1
    @GarethD I would argue that you have answered the problem. – Conrad Frix Jun 08 '15 at 17:57
  • @GarethD Thanks a lot, this is great. Unfortunately Pivot and Unpivot don't seem to be built-in commands for my version. I have to figure out how to do what you've done without using those. Following your advice, I decoupled my data to have one row per patient, per symptom. Now I have to do the second part ...I think I'm almost there! Thanks a lot. – mab Jun 09 '15 at 19:05

3 Answers3

0

Your question needs to write functions and using cursors in them.

However, there is an alternative approach:

Suppose you have a table with four columns:

a   b   c   d
-------------------------
1   0   1   1
1   1   0   0
0   0   1   0
0   0   0   1
1   1   1   1
0   1   0   1
1   0   1   0
0   1   1   1
0   0   1   1
1   0   1   0

This is a the answer:

Select sum(a) as a_a, 
    (select count(*) from patients where a=1 and b=1 as a_b) as a_b, 
    (select count(*) from patients where a=1 and c=1 as a_c) as a_c, 
    (select count(*) from patients where a=1 and d=1 as a_d) as a_d, 
    sum(b) as b_b, 
    (select count(*) from patients where b=1 and c=1 as b_c) as b_c, 
    (select count(*) from patients where b=1 and d=1 as b_d) as b_d, 
    sum(c) as c_c, 
    (select count(*) from patients where c=1 and d=1 as c_d) as c_d, 
    sum(d) as d_d 

Now, the result is like this:

a_a     a_b     a_c     a_d     b_b     b_c     b_d     c_c     c_d     d_d
-------------------------------------------------------------------------
5       2       3       2       4       2       2       7       4       6

It is not like a matrix; it has only one row, but it has everything you would like to have. You can expand it to your own table with many fields.

Alisa
  • 2,892
  • 3
  • 31
  • 44
  • Thanks...The problem is the list is 100 symptoms, and I want to try multiple lists so was trying to avoid having to type them out manually, one by one. – mab Jun 09 '15 at 19:03
0

You want to study the interplay between different symptoms, right?

In this case, it is better to get the correlation between different symptoms and SQL does not suite this problem; you need to convert the table to a csv file, then, using R (or even excel), you can get the correlation.

Suppose this is your CSV file (C:/dataFile.csv):

a, b, c, d, e, f
----------------
1, 1, 1, 0, 1, 0
1, 1, 0, 1, 1, 1
0, 0, 0, 1, 0, 0
0, 1, 1, 0, 1, 0
1, 0, 0, 0, 1, 0
0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0
1, 1, 0, 1, 1, 1

Now, in R Statistics, you can run the following commands one by one:

> data <- read.csv("C:/dataFile.csv")
> summary(data)
> cor(data)

And this is the result:

          a         b          c           d          e          f
a 1.0000000 0.5000000  0.0000000  0.25819889 0.77459667  0.5773503
b 0.5000000 1.0000000  0.5773503  0.25819889 0.77459667  0.5773503
c 0.0000000 0.5773503  1.0000000 -0.44721360 0.44721360 -0.3333333
d 0.2581989 0.2581989 -0.4472136  1.00000000 0.06666667  0.7453560
e 0.7745967 0.7745967  0.4472136  0.06666667 1.00000000  0.4472136
f 0.5773503 0.5773503 -0.3333333  0.74535599 0.44721360  1.0000000

Two symptoms with higher correlations means that those two change mostly together. For example, [a and e] or [b and e] are highly correlated.

I hope that give you a broader idea of how to deal with data analysis.

Alisa
  • 2,892
  • 3
  • 31
  • 44
  • Thanks, the problem is that the data is too large for R. – mab Jun 09 '15 at 19:03
  • R is designed for big data! How large is it? How many records are there? – Alisa Jun 09 '15 at 19:06
  • "The pbdNCDF4 package permits multiple processes to write to the same file (without manual synchronization) and supports terabyte-sized files." Quote from: http://cran.r-project.org/web/views/HighPerformanceComputing.html – Alisa Jun 09 '15 at 19:15
  • Theoretically, R suppots up to 2 billion elements (20,000,000 records of 100 columns each). However, if your machine becomes slow, then you can install some R packages: http://stackoverflow.com/questions/9984283/maximum-size-of-a-matrix-in-r/9984368#9984368 – Alisa Jun 09 '15 at 19:17
  • Interesting, thanks - I didn't know. Mine is about half a billion records though. – mab Jun 09 '15 at 19:50
0

This answer by @GarethD, helped solved the problem: Thanks!

The key is to unpivot your data so that you have one row per patient per symptom, then join this data to itself to get pairs of symptoms, then pivot the joined data back up to get your counts. Since I don't know which DBMS to use, I cannot answer the question, I have however created a working demo in SQL Server that should demonstrate how to do it. – GarethD yesterday

mab
  • 73
  • 1
  • 7