0

I'm trying very hard to break my C mold, as you'll see, it's still present in my R code. I know there will be a smart R way of doing this!

Trying to essentially go through a long list of individuals held in a DF. Each individual can have multiple rows in this if they have taken more than one particular drug or even multiple instances of the same drug. Per row there is a drug name entry. Similar to:

patientID drugname
1         A
2         A
2         B
3         C
3         C
4         A

I have a list containing the unique drug names from this DF (A, B, C). I would like to build a dataframe with columns drugname and drugCount. In the drugCount I want to count up the number of unique instances a drug was prescribed but not multiple counts per person, more of a binary operation of "was this drug given to person X?".

A start of an attempt using a very C-style manner:

uniqueDrugList <- unique(therapyDF$prodcode)
numDrugs <- length(uniqueDrugList)
prevalenceDF <-as.data.frame(drugName=character(numDrugs),drugcount=integer(numDrugs),prevalence=numeric(numDrugs),stringsAsFactors=FALSE)
for(i in 1:length(idList)) {
    individualDF <- subset(therapyDF,therapyDF$patid==idList[[i]])

    for(j in 1:numDrugs) {
        if(uniqueDrugList[[j]] %in% individualDF%prodcode) {
        prevalenceDF  <---- some how tally up here
    }

}

Firstly, I take a subset of my main DF by identifying each individual with a particular ID for a list of unique IDs. Then, for each unique drug (and this is where it is slow), I want to see whether that drug is present in that individual's records. I would like to increment a 1 to an entry if present, else moves on to the next individual's subset.

Expected output

drugname   count
A          3
B          1
C          1
Dharman
  • 30,962
  • 25
  • 85
  • 135
Anthony Nash
  • 834
  • 1
  • 9
  • 26
  • Can you show your expected output? Perhaps `library(dplyr);df %>% group_by(patientID) %>% summarise(n = n_distinct(drugname))` – akrun May 18 '18 at 07:03
  • Thanks. Edited. I need the R route, with around 100 million records, my C style solution will take an age. – Anthony Nash May 18 '18 at 07:10

1 Answers1

0

We can do a group by 'drugname' and get the length of unique elements of 'patientID'

library(dplyr)
df %>% 
  group_by(drugname) %>%
  summarise(count = n_distinct(patientID))
# A tibble: 3 x 2
#  drugname  count
#  <chr>    <int>
#1 A            3
#2 B            1
#3 C            1

Or use table from base R after getting the unique rows

table(unique(df)[2])
akrun
  • 874,273
  • 37
  • 540
  • 662