Convert string arrays to data frame in R

Question

Suppose I have a string array such like:

sa<-c("HLA:HLA00001 A*01:01:01:01 1098 bp",
      "HLA:HLA01244 A*01:01:02 546 bp",
      "HLA:HLA01971 A*01:01:03 895 bp")

My question is what is the best way to convert it to a data frame such like:

  Seq          Type             Length
1 HLA:HLA00001 A*01:01:01:01    1098 bp
2 HLA:HLA01244 A*01:01:02       546 bp
3 HLA:HLA01971 A*01:01:03       895 bp

See also [here](http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns) if you really were just splitting by a space. There are plenty more dupes all over SO — David Arenburg, Jun 23 '16 at 19:28
I saw the link you posted here, then why isn't this post marked as duplicate ? — user5249203, Jun 23 '16 at 20:25

Konrad Rudolph · Accepted Answer · 2016-06-23T19:22:07.600

Using the ‹dplyr› and ‹tidyr› packages, this is trivial:

Put data into a data_frame,
separate columns:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)

Source: local data frame [3 x 3]

           Seq          Type Length
         (chr)         (chr)  (int)
1 HLA:HLA00001 A*01:01:01:01   1098
2 HLA:HLA01244    A*01:01:02    546
3 HLA:HLA01971    A*01:01:03    895

This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge' instead.

You can further separate the Type column by the application of another ‹tidyr› function, quite similar to separate, but specifying which parts to match: extract. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:

'(A\\*\\d{2}:\\d{2}):(.*)'

This means: extract two groups — the first group containing the string “A*” followed by two digits, “:” and another two digits. And the second group containing all the rest of the text, after a separating “:” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).

Put together with the code from above:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
    extract(Type, c('Group', 'Allele'), regex = '(A\\*\\d{2}:\\d{2}):(.*)')

Source: local data frame [3 x 4]

           Seq   Group Allele Length
         (chr)   (chr)  (chr)  (int)
1 HLA:HLA00001 A*01:01  01:01   1098
2 HLA:HLA01244 A*01:01     02    546
3 HLA:HLA01971 A*01:01     03    895

Thanks for your help again! If I would add extra column which are the strings before the second `:` of column 'Type' (e.g. A*01:01 from A*01:01:01:01 of the 1st row), would you have some suggestions? — David Z, Jun 23 '16 at 18:38

Psidom · Answer 2 · 2016-06-23T20:12:20.813

Use read.table, which will require some extra effort since you have the delimiter within the column that you want to keep together:

df <- read.table(text = sa, col.names = c("Seq", "Type", "Length", "Unit"))
df$Length <- paste(df$Length, df$Unit)
df[,-4]
#            Seq          Type  Length
# 1 HLA:HLA00001 A*01:01:01:01 1098 bp
# 2 HLA:HLA01244    A*01:01:02  546 bp
# 3 HLA:HLA01971    A*01:01:03  895 bp

score 3 · Answer 3 · answered Jun 23 '16 at 18:25

3

Use this

as.data.frame.matrix(do.call(rbind,strsplit(sa,"\\s")))

answered Jun 23 '16 at 18:25

user2100721

3,557
2
20
29

This gives four columns. OP wants three. – Rich Scriven Jun 23 '16 at 18:40
1

But apparently nobody cares, haha – Rich Scriven Jun 23 '16 at 19:03
@ Richard Scriven : I did that intentionally because the accepted answer has 3 columns but `bp` part is missing. That's why I am confused what is exactly needed. By simple manipulation of my code anyone can reach their goal. – user2100721 Jun 23 '16 at 19:03
Thank you guys~, I like your answer too but just prefer to use dplyr package. – David Z Jun 23 '16 at 20:11

score 0 · Answer 4 · answered Jun 23 '16 at 20:27

Yet another simple solution using stringr:

library(stringr)
df <- as.data.frame(str_split_fixed(sa, " ", 3))
colnames(df) <- c("Seq", "Type", "Length")

#           Seq          Type  Length
#1 HLA:HLA00001 A*01:01:01:01 1098 bp
#2 HLA:HLA01244    A*01:01:02  546 bp
#3 HLA:HLA01971    A*01:01:03  895 bp

Convert string arrays to data frame in R

4 Answers4