I have some protein data in the format of variable / value. The 'value' is self-explanatory. The 'variable' is a string in the form 'PRTN_ASSAYCODE' where 'PRTN' is a particular protein, and 'ASSAYCODE' is a separate string for the sequence used to detect the protein. For a given protein, there are one, two or three different sequences.
What I'm trying to do is to split the string into two separate variables, and use them for a facet_grid in ggplot (proteins shown vertically and the different method for each shown horizontally). To do this, I need to create a new variable (1,2 or 3, or some other factor).
For example:
input output
ALBU_AAFZXAA --> ALBU, 1
ALBU_AAFZXAA --> ALBU, 1
ALBU_ABGHHSA --> ALBU, 2
FIBR_HFGIAAO --> FIBR, 1
FIBR_YOUSAAA --> FIBR, 2
FIBR_ERAATTA --> FIBR, 3
I can use strsplit to split the string, I.e. I have the protein code, but not the assay code in a usable form.
My best guess so far is to use a for loop to run down the dataframe, looking for changes in the first part of string, then annotating any change in the second part of the string. But it's really cumbersome and error-prone.
Any helpful ideas? My dataframe has ~3000 rows so annotating manually is not an option.