How to write a function which can count individuals once even with multiple matches across columns

Question

I have a data set which has patient diagnostic (ICD-9) codes, which can have a length between 3-5 digits, where the first three digits represent a classification of diagnosis, and the 4th and 5th represent a further refinement of the classification. For example:

zz<-"     dx1   dx2   dx3
1  64251 82381  8100
2   8052  8730 51881
3  64421   431 81601
4   3041 29690  9920
5  72888  8782 59080
6   7245 60886  8479
7    291  4659  4739
8  30410 30400 95901
9   2929 30500  8208
10  7840  6268  8052"
df<-read.table(text=zz, header=TRUE)

Each row of codes represents multiple diagnoses of the same individual. I have written a series of ifelse statements to create a new variable with the codes I’m interested in so they are mapped to numbers representing different diagnoses of interest:

df$x<-ifelse(grepl("^291", dx1),1, ifelse(grepl("^292", dx1),1 
       ifelse(grepl("^3040", dx1),2,ifelse(grepl("^3047", dx1),2,
       ifelse(grepl("^3051", dx1),3,ifelse(grepl("^98984", dx1),3,0))))))

Where I run into trouble is when I want to check for these select codes across each of the columns containing diagnostic codes. I attempted to write a function for this:

df$alldx<-apply(df[,c(1:3)],MARGIN = 2, function(dx){
  ifelse(grepl("^291", dx),1, ifelse(grepl("^292", dx),1 
  ifelse(grepl("^3040", dx),2,ifelse(grepl("^3047", dx),2,
  ifelse(grepl("^3051", dx),3,ifelse(grepl("^98984", dx),3,0)))))) 
})

The problem is I only want to count an individual once if they have one of the codes of interest; in the case of multiple code matches, then that person’s code should be whichever diagnosis was given first. I feel like there must be a way to do this, but it’s well beyond my coding abilities!

I don't understand exactly what is your desired output. Also it would be good if you shared data for reproducing. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — s_baldur, Jul 29 '16 at 19:13
If I understand the problem correctly you want to create a count flag based specific values from all the columns `dx1` etc. but there's the possibility that the same value can appear in multiple columns? — Warner, Jul 29 '16 at 19:19
"Each row of codes represents multiple diagnoses of the same individual." -- So every individual has exactly three diagnoses? — Frank, Jul 29 '16 at 19:22
Not just that the same value can appear in multiple columns, but also that I don't want to count an individual more than once. — Valerie, Jul 29 '16 at 19:24

Frank · Accepted Answer · 2016-07-29T19:41:18.337

Here's what I would do, using the data.table package for convenience:

library(data.table)
setDT(df)[, id := .I]

DF = melt(df, id="id")[, 
  `:=`(diag = substr(value, 1, 3), ref = substr(value, 4, 5))][order(id)]

So now the data looks like

    id variable value diag ref
 1:  1      dx1 64251  642  51
 2:  1      dx2 82381  823  81
 3:  1      dx3  8100  810   0
 4:  2      dx1  8052  805   2
 5:  2      dx2  8730  873   0
 6:  2      dx3 51881  518  81
 7:  3      dx1 64421  644  21
 8:  3      dx2   431  431    
 9:  3      dx3 81601  816  01
10:  4      dx1  3041  304   1
11:  4      dx2 29690  296  90
12:  4      dx3  9920  992   0
13:  5      dx1 72888  728  88
14:  5      dx2  8782  878   2
15:  5      dx3 59080  590  80
16:  6      dx1  7245  724   5
17:  6      dx2 60886  608  86
18:  6      dx3  8479  847   9
19:  7      dx1   291  291    
20:  7      dx2  4659  465   9
21:  7      dx3  4739  473   9
22:  8      dx1 30410  304  10
23:  8      dx2 30400  304  00
24:  8      dx3 95901  959  01
25:  9      dx1  2929  292   9
26:  9      dx2 30500  305  00
27:  9      dx3  8208  820   8
28: 10      dx1  7840  784   0
29: 10      dx2  6268  626   8
30: 10      dx3  8052  805   2
    id variable value diag ref

where id is the patient ID, diag is the diagnosis and ref is the refinement, if any. We can see that there is not a lot of duplication in diagnoses in the example data:

DF[, sum(duplicated(diag)), by=id]

    id V1
 1:  1  0
 2:  2  0
 3:  3  0
 4:  4  0
 5:  5  0
 6:  6  0
 7:  7  0
 8:  8  1
 9:  9  0
10: 10  0

Now, if we wanted to count how many patients have a diagnosis of 304, we can do:

DF[diag == "304", uniqueN(id)]

[1] 2

If you wish to map codes to new codes (which might be more confusing than anything else), I'd recommend doing so by making a separate table describing the mapping and merging on it to assign the new codes. I think data.table is also handy for that. The intro materials for the package are a good place to start with it.

A brief example for merging:

m = fread("
diag new_code
291 1
292 1 
304 2
305 3
989 3", colClasses=c(diag = "character"))

DF[m, on="diag", new_code := i.new_code ]

How to write a function which can count individuals once even with multiple matches across columns

1 Answers1