1

I have some protein data in the format of variable / value. The 'value' is self-explanatory. The 'variable' is a string in the form 'PRTN_ASSAYCODE' where 'PRTN' is a particular protein, and 'ASSAYCODE' is a separate string for the sequence used to detect the protein. For a given protein, there are one, two or three different sequences.

What I'm trying to do is to split the string into two separate variables, and use them for a facet_grid in ggplot (proteins shown vertically and the different method for each shown horizontally). To do this, I need to create a new variable (1,2 or 3, or some other factor).

For example:

input            output
ALBU_AAFZXAA --> ALBU, 1
ALBU_AAFZXAA --> ALBU, 1
ALBU_ABGHHSA --> ALBU, 2
FIBR_HFGIAAO --> FIBR, 1
FIBR_YOUSAAA --> FIBR, 2
FIBR_ERAATTA --> FIBR, 3

I can use strsplit to split the string, I.e. I have the protein code, but not the assay code in a usable form.

My best guess so far is to use a for loop to run down the dataframe, looking for changes in the first part of string, then annotating any change in the second part of the string. But it's really cumbersome and error-prone.

Any helpful ideas? My dataframe has ~3000 rows so annotating manually is not an option.

  • Welcome to SO. Would you mind adding a reproducible data set using `dput`? – Maël Dec 09 '21 at 12:52
  • 2
    that said, have you tried tidyr::separate? – Maël Dec 09 '21 at 12:53
  • 1
    You could improve your chances of finding help here by adding a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Adding a MRE and an example of the desired output (in code form, not tables and pictures) makes it much easier for others to find and test an answer to your question. That way you can help others to help you! P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) – dario Dec 09 '21 at 12:57

2 Answers2

2

Using data.table function tstrplit() and rleid() - the former is splitting the string, the latter is creating sequential. The bymakes rleid() reset for each protein.

library(data.table)
data <- data.table(
  protein = c("ABC_DFG", "ABC_DFG", "ABC_HIJ", "XYZ_TUV")
)
# Solution:
data[, `:=`("ID1" = tstrsplit(protein, "_")[[1]], 
            "ID2" = rleid(tstrsplit(protein, "_")[[2]])),
     by=tstrsplit(protein, "_")[[1]]]

Results in

> data
   protein ID1 ID2
1: ABC_DFG ABC   1
2: ABC_DFG ABC   1
3: ABC_HIJ ABC   2
4: XYZ_TUV XYZ   1

A tidier bit of code, using data.table chaining (DT[][])

data[, ID1 := tstrsplit(protein, "_")[[1]]][, 
       ID2 := rleid(tstrsplit(protein, "_")[[2]]), by=ID1]
rg255
  • 4,119
  • 3
  • 22
  • 40
  • 2
    just a suggestion to clean your code a bit: `data[, c("ID1", "ID2") := tstrsplit(protein, "_")][, ID2 := rleid(ID2), by=.(ID1)]` – koolmees Dec 09 '21 at 13:05
  • yes you're right, thanks - wrote in a rush before a meeting – rg255 Dec 09 '21 at 13:35
  • Thank you, seems like a great solution. I went with the tidy solution as the syntax is a bit friendlier to a noob like me :-) – Superficial Dec 10 '21 at 16:47
1

Use tidyr::separate. You can then use v1and v2 as unique identifier for your facet_grid.

data %>% separate(protein, c("v1","v2"))
    v1      v2
1 ALBU AAFZXAA
2 ALBU AAFZXAA
3 ALBU ABGHHSA
4 FIBR HFGIAAO
5 FIBR YOUSAAA
6 FIBR ERAATTA

To get a numeric id, add data.table::rleid.

data %>% separate(protein, c("v1","v2")) %>% 
  group_by(v1) %>% 
  mutate(group = data.table::rleid(v2)) 

Data

data <- data.frame(protein = c("ALBU_AAFZXAA", "ALBU_AAFZXAA", "ALBU_ABGHHSA", 
                              "FIBR_HFGIAAO","FIBR_YOUSAAA","FIBR_ERAATTA"))
Maël
  • 45,206
  • 3
  • 29
  • 67
  • 1
    I think tidyr::separate is the way to go, but I really don't think the rleid step is necessary at all because we already have unique identifier in v2 - the OP did not specify that it *has* to be numbers, just something in order to create the facets. Also - how did you create the data? – tjebo Dec 09 '21 at 13:17
  • That's right, I misunderstood and thought another variable was necessary as shown in the output. Answer edited. – Maël Dec 09 '21 at 13:19
  • I think you need the 1/2/3 reading the question — it looks like each protein has up to three different assay codes, but looking at the data, there are more than three unique assay codes across the data so they need to be recoded within protein to 1, 2 or 3 – rg255 Dec 09 '21 at 14:03
  • This was perfect (and simple!) - thank you! – Superficial Dec 10 '21 at 16:46