0

I want to split values contained in a single column into new columns.

I have some data that looks like this in a file: > df V1 1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050 2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474

I'd like to separate each value into a new column : V1, V2, V3, V4, V5 and V6

I tried :
df2 <- data.frame(do.call('rbind', strsplit(as.character(df$V1), ' ', fixed = FALSE)))

I end up with output like this:

X1 X2 X3 X4 X5 X6 1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474 X7 X8 1 14050 00006303657102064942660780914135165036 2 00006319625527159782351492300309533775 12867 3 00006327933867965144524703512179615086 12867

Some of the v1 values end up in other columns. It may be happening because there is no space at the end of the row. How can i execute this correctly?

thanks

vagabond
  • 3,526
  • 5
  • 43
  • 76
  • Would it work if you used `read.table()` with `sep = ' '`? `read.csv()` expects a comma as a delimeter. Also, the first row would have 7 columns. Are all of the values in the first row intentional? – blakeoft Jan 22 '15 at 19:53
  • did try: getting this error: `Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 30 elements` – vagabond Jan 22 '15 at 19:57
  • R expects each row to have the same number of columns. Either way, if you chop off the last value of your first row (14050), then you're `df2 <- ...` line works. I'm not sure what your whole data set looks like so it's a little difficult to help more. – blakeoft Jan 22 '15 at 20:01
  • the whole data set is much larger with each row having a varying number of columns - basically the "00000xxxxxxxxxxxxxxx . .. " is the ID and the rest "12867" , "14050" etc. are groups to which the ID belongs. Each ID can belong to a different number of groups. The row count is in several millions. The exact column count is unknown. – vagabond Jan 22 '15 at 20:10
  • I'm going to try and extract all numbers with the pattern of starting with 4 0s from the data frame. – vagabond Jan 22 '15 at 20:12
  • 1
    Maybe a list would be a better fit for your data set. You could potentially have a row with many NAs in a data frame, which is undesirable. – blakeoft Jan 22 '15 at 20:41

2 Answers2

1
library(tidyr)
library(dplyr)

df <- read.table(
  header = FALSE, 
  text = "
00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
00006319625527159782351492300309533775 12867 15473 13678 13497 15397
00006327933867965144524703512179615086 12867 14245 15397 15473 15474
",
  sep = "\n"
  )

df %>%
  separate(
    V1, 
    into = paste0("V", 1:7),
    # 'extra' allows the number of columns to differ by row
    extra = "drop"
    )

                                      V1    V2    V3    V4    V5    V6    V7
1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397  <NA>
3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474  <NA>
davechilders
  • 8,693
  • 2
  • 18
  • 18
1

Good old plyr works, too:

txt <- readLines(n = 3)
1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050 
2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397 
3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474

library(plyr)
rbind.fill(
  lapply(
    strsplit(txt, " "), 
    function(y) {
      as.data.frame(t(y),stringsAsFactors=FALSE) # via @Arun http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column
    }
  )
)
#   V1                                     V2    V3    V4    V5    V6    V7    V8
# 1  1 00006303657102064942660780914135165036 12867 15476 15473 15474 15397 14050
# 2  2 00006319625527159782351492300309533775 12867 15473 13678 13497 15397  <NA>
# 3  3 00006327933867965144524703512179615086 12867 14245 15397 15473 15474  <NA>
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • one problem - The file is very big - I don't know if it is feasible to paste millions of lines into `readLines` and I don't know the exact value of n. – vagabond Jan 22 '15 at 20:33
  • 1
    @vagabond - If you don't specify `n` in your call, then `readLines()` will read until the end of the connection. No need to paste. – davechilders Jan 22 '15 at 20:40
  • the txt file is unexpectedly large - I hit my memory limit of 8gb. apart from getting more memory - is there any way of implementing this function in parts? – vagabond Jan 22 '15 at 21:38