2

Suppose I have a string array such like:

sa<-c("HLA:HLA00001 A*01:01:01:01 1098 bp",
      "HLA:HLA01244 A*01:01:02 546 bp",
      "HLA:HLA01971 A*01:01:03 895 bp")

My question is what is the best way to convert it to a data frame such like:

  Seq          Type             Length
1 HLA:HLA00001 A*01:01:01:01    1098 bp
2 HLA:HLA01244 A*01:01:02       546 bp
3 HLA:HLA01971 A*01:01:03       895 bp
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
David Z
  • 6,641
  • 11
  • 50
  • 101
  • See also [here](http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns) if you really were just splitting by a space. There are plenty more dupes all over SO – David Arenburg Jun 23 '16 at 19:28
  • I saw the link you posted here, then why isn't this post marked as duplicate ? – user5249203 Jun 23 '16 at 20:25

4 Answers4

5

Using the ‹dplyr› and ‹tidyr› packages, this is trivial:

  1. Put data into a data_frame,
  2. separate columns:
data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)
Source: local data frame [3 x 3]

           Seq          Type Length
         (chr)         (chr)  (int)
1 HLA:HLA00001 A*01:01:01:01   1098
2 HLA:HLA01244    A*01:01:02    546
3 HLA:HLA01971    A*01:01:03    895

This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge' instead.

You can further separate the Type column by the application of another ‹tidyr› function, quite similar to separate, but specifying which parts to match: extract. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:

'(A\\*\\d{2}:\\d{2}):(.*)'

This means: extract two groups — the first group containing the string “A*” followed by two digits, “:” and another two digits. And the second group containing all the rest of the text, after a separating “:” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).

Put together with the code from above:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
    extract(Type, c('Group', 'Allele'), regex = '(A\\*\\d{2}:\\d{2}):(.*)')
Source: local data frame [3 x 4]

           Seq   Group Allele Length
         (chr)   (chr)  (chr)  (int)
1 HLA:HLA00001 A*01:01  01:01   1098
2 HLA:HLA01244 A*01:01     02    546
3 HLA:HLA01971 A*01:01     03    895
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Thanks for your help again! If I would add extra column which are the strings before the second `:` of column 'Type' (e.g. A*01:01 from A*01:01:01:01 of the 1st row), would you have some suggestions? – David Z Jun 23 '16 at 18:38
  • Thanks, Konrad! It's very helpful. – David Z Jun 23 '16 at 20:08
4

Use read.table, which will require some extra effort since you have the delimiter within the column that you want to keep together:

df <- read.table(text = sa, col.names = c("Seq", "Type", "Length", "Unit"))
df$Length <- paste(df$Length, df$Unit)
df[,-4]
#            Seq          Type  Length
# 1 HLA:HLA00001 A*01:01:01:01 1098 bp
# 2 HLA:HLA01244    A*01:01:02  546 bp
# 3 HLA:HLA01971    A*01:01:03  895 bp
Psidom
  • 209,562
  • 33
  • 339
  • 356
3

Use this

as.data.frame.matrix(do.call(rbind,strsplit(sa,"\\s")))
user2100721
  • 3,557
  • 2
  • 20
  • 29
0

Yet another simple solution using stringr:

library(stringr)
df <- as.data.frame(str_split_fixed(sa, " ", 3))
colnames(df) <- c("Seq", "Type", "Length")

#           Seq          Type  Length
#1 HLA:HLA00001 A*01:01:01:01 1098 bp
#2 HLA:HLA01244    A*01:01:02  546 bp
#3 HLA:HLA01971    A*01:01:03  895 bp
989
  • 12,579
  • 5
  • 31
  • 53