How to make a subset or categorical variable from a Variable having factors

Question

I am working on a large dataset. I have variable in the data frame For example called.

Part<-c(1,2,3,4,5,6,7)
Disease_codes>- c(A100,A145,B165,B187,B102,C132,D156)
df<-data.frame(Part,Disease_codes)

Actually I want to categorize all the disease codes starting from "A" as "Blood cancer". The Disease codes starting from alphabet A (for example A100,A145) are Blood Cancer. Because i need to exclude the participants having Blood cancer from my studies. Offcourse i cannot do this mannually as i have huge number of participants. So how can i make a subset of the people who have disease codes starting with A and then exclude them from my data frame. For example I want following kind of out put.

Blood_Cancer_Part<-c(1,2)
Part_without_Blood_cancer<-c(3,4,5,6,7)

Related possible duplicate https://stackoverflow.com/q/31467732/680068 — zx8754, Mar 18 '20 at 08:52

score 0 · Answer 1 · answered Mar 17 '20 at 14:15

Here is a way through which you can do it using stringr package to check for the first letter in the given text and accordingly create a column from Part column that already exist.

library(stringr)
library(dplyr)

# Creating the dataframe
Part <- c(1,2,3,4,5,6,7)
Disease_codes <- c("A100","A145","B165","B187","B102","C132","D156")
df <- data.frame(Part, Disease_codes)

df <-
  df %>%
  # If first letter of Disease_codes contains A then create column from value of Part
  mutate(Blood_Cancer_Part = ifelse(str_sub(Disease_codes, 1, 1) == "A", Part, NA_character_),
         # If first letter of Disease_codes does not contains A then 
         # create column from value of Part
         Part_without_Blood_cancer = ifelse(str_sub(Disease_codes, 1, 1) != "A", Part, 
                                            NA_character_))

# To view as vectors
df$Blood_Cancer_Part[!is.na(df$Blood_Cancer_Part)]
# [1] "1" "2"

df$Part_without_Blood_cancer[!is.na(df$Part_without_Blood_cancer)]
# [1] "3" "4" "5" "6" "7"

score 0 · Accepted Answer · answered Mar 17 '20 at 14:17

0

In base R, we can use subset :

BloodCancer <- subset(df, grepl('^A', Disease_codes), select = Part)
#OR
#BloodCancer <- subset(df, startsWith(Disease_codes, "A"))
BloodCancer

#  Part
#1    1
#2    2


Part_without_Blood_cancer <- subset(df, !grepl('^A', Disease_codes))
#OR
#Part_without_Blood_cancer <- subset(df, !startsWith(Disease_codes, "A"))
Part_without_Blood_cancer

#  Part
#3    3
#4    4
#5    5
#6    6
#7    7

data

Part<-c(1,2,3,4,5,6,7)
Disease_codes <- c("A100","A145","B165","B187","B102","C132","D156")
df<-data.frame(Part,Disease_codes, stringsAsFactors = FALSE)

answered Mar 17 '20 at 14:17

Ronak Shah

377,200
20
156
213

Thanks. But How can i substract the Blood_Cancer (Participants) from my df. How to get following output. df_BC Part Disease_Codes 3 B165 4 B187 5 B102 6 C132 – Aryh Mar 18 '20 at 10:43
1

@Aryh Isn't that `Part_without_Blood_cancer` ? Or you can use `subset(df, !Part %in% BloodCancer$Part)` – Ronak Shah Mar 18 '20 at 10:52
Can you help me how to make a subset of BloodCancer if all the participants having Disease_codes A100-A180 are included as BloodCancer. Following doesnot work Part_without_Blood_cancer <- subset(df, !grepl('^A100:A180', Disease_codes)) – Aryh Apr 23 '20 at 16:32
You should probably ask a new question for it but you can try `subset(df, Disease_codes %in% paste0('A', 100:180))` – Ronak Shah Apr 23 '20 at 23:26

How to make a subset or categorical variable from a Variable having factors

2 Answers2