0

I have a dataset df1 like so:

snp <- c("rs7513574_T", "rs1627238_A", "rs1171278_C")
p.value <- c(2.635489e-01, 9.836280e-01 , 6.315047e-01  )

df1 <- data.frame(snp, p.value)

I want to remove the _ underscore and the letters after it (representing allele) in df1 and make this into a new dataframe df2

I tried this using the code

df2 <- df1[,c("snp", "allele"):=tstrsplit(`snp`, "_", fixed = TRUE)]

However, this changes the df1 data frame. Is there another way to do this?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
codemachino
  • 33
  • 1
  • 7
  • Your example of `df1` doesn't work - the line `snp <- c(rs7513574_T, rs1627238_A, rs1171278_C)` can't be run unless `rs7513574_T` and the others are defined first (or are they supposed to be strings? Did you perhaps forget to quote them?) And the way you set up `df1` the columns are of mixed types, which is bad. Are the `"rs7513574_T`` etc. values supposed to be column names, or in the first row? And later you use `:=` which isn't base R - it could be from the `rlang` package or from the `data.table` package. Are you using `data.table`? – Gregor Thomas Apr 01 '21 at 14:02
  • Apologies for the messiness. The line `snp <- c(rs7513574_T, rs1627238_A, rs1171278_C)` is supposed to contain strings, and the "rs7513574_T" values are supposed to be in the 1st, 2nd, 3rd row of the column `snp` – codemachino Apr 01 '21 at 14:04
  • Okay, I've cleaned up your example by adding quotes to the strings, gotten rid of `rbind` which was making things rows when you wanted them to be columns, got rid of `matrix()` which was converting everything to `character`. Could you run the sample data code and verify that it is accurate? – Gregor Thomas Apr 01 '21 at 14:08
  • Also, your question text just says you want to remove the `_` and the letter after it, but your code seems to be attempting to put the letter after it into a new column called `"allele"` - if you want to do that you should mention it in the text. – Gregor Thomas Apr 01 '21 at 14:10
  • That looks great and is an accurate representation of the dataset, thank you! – codemachino Apr 01 '21 at 14:12
  • 1
    Quite welcome. Next time, please do test your sample data code before posting it :) It saves time for everyone, especially someone like user438383 who made some assumptions and a couple answer attempts based on bad input. – Gregor Thomas Apr 01 '21 at 14:13

4 Answers4

1

This is my best guess as to what you want:

library(tidyr)
separate(df1, snp, into = c("snp", "allele"), sep = "_")
#         snp allele   p.value
# 1 rs7513574      T 0.2635489
# 2 rs1627238      A 0.9836280
# 3 rs1171278      C 0.6315047
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • As there is only one delimiter, you could also do without specifying the `sep` `separate(df1, snp, into = c("snp", "allele"))` – akrun Apr 02 '21 at 01:01
0
df2 = df1 %>% 
    dplyr::mutate(across(c(V1, V2, V3), ~stringr::str_remove_all(., "_[:alpha:]")))
> df2
               V1        V2        V3
snp     rs7513574 rs1627238 rs1171278
p.value 0.2635489  0.983628 0.6315047
user438383
  • 5,716
  • 8
  • 28
  • 43
  • is it possible to do this and create a new dataframe of it? for example a duplicate of `df1` but with the underscore and letter removed? – codemachino Apr 01 '21 at 13:48
  • Yes - if you edit your question to provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your df then I will edit my answer to show how it's done. You only need to make a small example dataset with a couple of lines,. – user438383 Apr 01 '21 at 13:49
  • created a reproducible example for you, hopefully that helps! – codemachino Apr 01 '21 at 13:54
  • done - hopefully that is the thing you were looking for? – user438383 Apr 01 '21 at 14:07
0

Try:

df2 <- df1 %>% mutate(snp=gsub("_.","",snp))
Marcos Pérez
  • 1,260
  • 2
  • 7
0

Consider creating a copy of the dataset and do the tstrsplit on the copied data to avoid changes in original data

library(data.table)
df2 <- copy(df1)
setDT(df2)[,c("snp", "allele") := tstrsplit(snp, "_", fixed = TRUE)]
akrun
  • 874,273
  • 37
  • 540
  • 662