Using the ‹dplyr› and ‹tidyr› packages, this is trivial:
- Put data into a
data_frame
,
separate
columns:
data_frame(sa) %>%
separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)
Source: local data frame [3 x 3]
Seq Type Length
(chr) (chr) (int)
1 HLA:HLA00001 A*01:01:01:01 1098
2 HLA:HLA01244 A*01:01:02 546
3 HLA:HLA01971 A*01:01:03 895
This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge'
instead.
You can further separate the Type
column by the application of another ‹tidyr› function, quite similar to separate
, but specifying which parts to match: extract
. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:
'(A\\*\\d{2}:\\d{2}):(.*)'
This means: extract two groups — the first group containing the string “A*
” followed by two digits, “:
” and another two digits. And the second group containing all the rest of the text, after a separating “:
” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).
Put together with the code from above:
data_frame(sa) %>%
separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
extract(Type, c('Group', 'Allele'), regex = '(A\\*\\d{2}:\\d{2}):(.*)')
Source: local data frame [3 x 4]
Seq Group Allele Length
(chr) (chr) (chr) (int)
1 HLA:HLA00001 A*01:01 01:01 1098
2 HLA:HLA01244 A*01:01 02 546
3 HLA:HLA01971 A*01:01 03 895