0

I have about 200 data frames that are each 100000 rows by 45 columns. The columns are all the same. I would like to stack these into one data frame.

(I got the 160 data frames by splitting a LARGE text file into 200 smaller ones and using read.csv())

Some of the columns contain strings and some contain numbers. I have read this answer and know I shouldn't use rbind() to accomplish this, but I am running into trouble. The V1 columns in my data set contain strings. But when I run my code to insert just the first 100000:

#load in miniset1
load("filepath.Rda")

filetest <- data.frame(matrix(nrow=2000000, ncol = 45))
colnames(filetest)<-gsub("X", "V", colnames(filetest))
filetest[1:100000,]<-miniset1
head(filetest)

....it looks like it is trying to make V1 a number instead of a string. For example, it print the number "5777" in the head() call instead of the name that is written there. Is there a way I can specify that when I am making the initial matrix? I would rather just be able to use the characteristics from one of the datasets than have to go in an manually code whether each of the columns is string or numeric.

Community
  • 1
  • 1
garson
  • 1,505
  • 3
  • 22
  • 56
  • 1
    `matrix(nrow=2000000, ncol = 45)` is (1) a matrix, which means every column has the same type and (2) all logical-class `NA`s, since that is the default value when no values are passed to `matrix()`. Construct a data.frame with the right column types; don't use such a matrix. – Frank Jul 17 '15 at 18:32
  • Do the column names and types match across all 200 data frames? – ulfelder Jul 17 '15 at 19:46
  • Yes - they do. I am trying Nick Carruthers' answer below, but running into an issue with "invalid factor level, NA generated" – garson Jul 17 '15 at 19:56
  • 1
    Did you specify `stringsAsFactors=FALSE` when reading the .csvs into R? It sounds like some of your later frames have values of your string variable that don't appear in the first one. R is reading that string variable as a factor; and then it's balking when you try to add a data frame with a string (now factor level) that wasn't seen in the first one. That's my guess, anyway, but if it's right, then using `stringsAsFactors=FALSE` when you pull in all the .csvs should fix it. – ulfelder Jul 17 '15 at 20:25
  • Yeah, I think that's it @ulfelder. I didn't use that when I read them in. Thanks – garson Jul 17 '15 at 20:26

3 Answers3

0

You can expand the first dataframe and keep the column types:

first<-data.frame(A="a", N=5)  
filetest<-first[1:2000000,]  

then fill in.

Frank
  • 66,179
  • 8
  • 96
  • 180
Nick Carruthers
  • 267
  • 2
  • 9
  • This looks very close to what I want, and I was able to successfully add in the first set of 100000. However, when I go to add in the next one it complains about factors that were not in the first 100000 and gives me the error " In `[<-.factor`(`*tmp*`, iseq, value = invalid factor level." I tried adding factorsAsStrings = FALSE to the data.frame function but no luck. – garson Jul 17 '15 at 20:12
0

You could just use Reduce() with merge() on a list of the data frames:

big.df <- Reduce(function(...) merge(..., all=TRUE), list(df1, df2, etc.))

Note, though, that this will work if and only if there are no duplicate rows across all of the data frames. If there are duplicates, this procedure will only return a single row for each set of n duplicates.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
-1

The link you gave warns you against growing a data.frame one row at a time, you're not doing that. Personally, I'd read in the data with something that hand handle really big .csv files like sqldf. Here's a link Quickly reading very large tables as dataframes in R

Community
  • 1
  • 1
pdb
  • 1,574
  • 12
  • 26