0

I am a bit confused on how to perform this form of data wrangling, as I am new to R coding. My goal is to match subjectID information to this large data set that I have that have more rows than that of the subjectID data. This is because the large data has more than one session with a cohort of subjects. For example,Subject A would have data that has a row name SubjectA-01, SubjectA-02, etc.

My goal is to match SubjectID name to the large data set, such that I can add new columns (sex, age, BMI, etc.) as columns correlating to the data.

We can call this dataframe SubjectID <-

Subject ID Sex Age
SubjectA M 32
SubjectB F 16

And I want to use this information to match the beginning keyword in this matrix. Lets call this data set as BioResults.

SampleID Blood Result
SubjectA-01 2.34
SubjectA-02 2.55
SubjectB-12 3.56

My goal is to make a new data set that looks like this:

SampleID Blood Result Sex Age
SubjectA-01 2.34 M 32
SubjectA-02 2.55 M 32
SubjectB-12 3.56 F 16

What would be the best way to achieve this? I would appreciate any help as I am still new to this coding language. Thank you!

MrFlick
  • 195,160
  • 17
  • 277
  • 295
SpiderK
  • 55
  • 6
  • Try `BioResults %>% tidyr::separate(SampleID, c("SampleID, "OtherId") %>% right_join(SampleId)` – MrFlick Jul 07 '21 at 07:38
  • What ever that "-01" chunk is. If you want to merge data, you need values that match exactly. It's easiest to remove the suffix to make the join work. – MrFlick Jul 07 '21 at 07:45
  • What would be the best way to remove the suffix if this is for large data, rows exceeding 900 for subjects? Sorry for the extra followups – SpiderK Jul 07 '21 at 07:49

1 Answers1

0

Does this work:

library(dplyr)
library(stringr)

BioResults %>% mutate(ID = str_remove(SampleID, '-..')) %>% 
       inner_join(subjectID, by = c('ID' = 'SubjectID')) %>% select(-ID)
     SampleID Blood.Result Sex Age
1 SubjectA-01         2.34   M  32
2 SubjectA-02         2.55   M  32
3 SubjectB-12         3.56   F  16

Data used:

BioResults
     SampleID Blood.Result
1 SubjectA-01         2.34
2 SubjectA-02         2.55
3 SubjectB-12         3.56
subjectID
  SubjectID Sex Age
1  SubjectA   M  32
2  SubjectB   F  16
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • Say if some subjects had longer numerical values after, such as subjectX-0910 or even some with letters such as subject-UY9, how would I adjust the code to fit these in as well? – SpiderK Jul 07 '21 at 08:03
  • @Bbkazu, in that case use `str_remove(SampleID, '-.*')` – Karthik S Jul 07 '21 at 08:10
  • I did that, and it looks like it kept the data set, but the subjectID information such as sex, age, etc. doesn't add as columns into the BioResults dataframe? Am I doing something wrong? – SpiderK Jul 07 '21 at 08:26
  • @Bbkazu, that shouldn't happen, you may be missing something, I can't recreate the issue you are facing at my end to fix it. – Karthik S Jul 07 '21 at 08:33