separating into columns in r - delimiter changes

Question

I want to split values contained in a single column into new columns.

I have some data that looks like this in a file: > df V1 1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050 2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474

I'd like to separate each value into a new column : V1, V2, V3, V4, V5 and V6

I tried :
df2 <- data.frame(do.call('rbind', strsplit(as.character(df$V1), ' ', fixed = FALSE)))

I end up with output like this:

X1 X2 X3 X4 X5 X6 1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474 X7 X8 1 14050 00006303657102064942660780914135165036 2 00006319625527159782351492300309533775 12867 3 00006327933867965144524703512179615086 12867

Some of the v1 values end up in other columns. It may be happening because there is no space at the end of the row. How can i execute this correctly?

thanks

Would it work if you used `read.table()` with `sep = ' '`? `read.csv()` expects a comma as a delimeter. Also, the first row would have 7 columns. Are all of the values in the first row intentional? — blakeoft, Jan 22 '15 at 19:53
did try: getting this error: `Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 30 elements` — vagabond, Jan 22 '15 at 19:57
R expects each row to have the same number of columns. Either way, if you chop off the last value of your first row (14050), then you're `df2 <- ...` line works. I'm not sure what your whole data set looks like so it's a little difficult to help more. — blakeoft, Jan 22 '15 at 20:01
the whole data set is much larger with each row having a varying number of columns - basically the "00000xxxxxxxxxxxxxxx . .. " is the ID and the rest "12867" , "14050" etc. are groups to which the ID belongs. Each ID can belong to a different number of groups. The row count is in several millions. The exact column count is unknown. — vagabond, Jan 22 '15 at 20:10
I'm going to try and extract all numbers with the pattern of starting with 4 0s from the data frame. — vagabond, Jan 22 '15 at 20:12
Maybe a list would be a better fit for your data set. You could potentially have a row with many NAs in a data frame, which is undesirable. — blakeoft, Jan 22 '15 at 20:41

score 1 · Answer 1 · answered Jan 22 '15 at 20:06

library(tidyr)
library(dplyr)

df <- read.table(
  header = FALSE, 
  text = "
00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
00006319625527159782351492300309533775 12867 15473 13678 13497 15397
00006327933867965144524703512179615086 12867 14245 15397 15473 15474
",
  sep = "\n"
  )

df %>%
  separate(
    V1, 
    into = paste0("V", 1:7),
    # 'extra' allows the number of columns to differ by row
    extra = "drop"
    )

                                      V1    V2    V3    V4    V5    V6    V7
1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397  <NA>
3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474  <NA>

lukeA · Accepted Answer · 2015-01-22T20:37:44.160

1

Good old plyr works, too:

txt <- readLines(n = 3)
1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050 
2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 
3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474

library(plyr)
rbind.fill(
  lapply(
    strsplit(txt, " "), 
    function(y) {
      as.data.frame(t(y),stringsAsFactors=FALSE) # via @Arun http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column
    }
  )
)
#   V1                                     V2    V3    V4    V5    V6    V7    V8
# 1  1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
# 2  2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397  <NA>
# 3  3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474  <NA>

edited Jan 22 '15 at 20:37

answered Jan 22 '15 at 20:26

lukeA

53,097
5
97
100

one problem - The file is very big - I don't know if it is feasible to paste millions of lines into `readLines` and I don't know the exact value of n. – vagabond Jan 22 '15 at 20:33
1

@vagabond - If you don't specify `n` in your call, then `readLines()` will read until the end of the connection. No need to paste. – davechilders Jan 22 '15 at 20:40
the txt file is unexpectedly large - I hit my memory limit of 8gb. apart from getting more memory - is there any way of implementing this function in parts? – vagabond Jan 22 '15 at 21:38

separating into columns in r - delimiter changes

2 Answers2