0

I have a number of large data files (.csv) on my local drive that I need to read in R, filter rows/columns, and then combine. Each file has about 33,000 rows and 575 columns.

I read this post: Quickly reading very large tables as dataframes and decided to use "sqldf".

This is the short version of my code:

Housing <- file("file location on my disk")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing', file.format = list(eol="/n")) *I am using Windows

I see "Housing_filtered" data.frame is created with Var1, but zero observations. This is my very first experience with sqldf. I am not sure why zero observations are returned.

I also used "read.csv.sql" and still I see zero observations.

Housing_filtered <- read.csv.sql(file = "file location on my disk",
                    sql = "select Var01 from file", 
                    eol = "/n",
                    header = TRUE, sep = ",")
Hessam
  • 1
  • 1
  • 5
  • Welcome to StackOverflow! We actually ask for debugging questions to have *reproducible* examples. Without your data we can't know the problem. However I would bet that it's a problem with the format that your file is expecting. Is the `class(Housing)` a data.frame? if so, what is the `dim(Housing)`? Can you import your file with `read.csv()` so that we know what we're getting? – Hack-R Jun 16 '18 at 20:15
  • Note that newline is `\n` and not `/n`. – G. Grothendieck Jun 16 '18 at 20:36
  • @ G. Grothendieck: When I change `\n` to `/n` , I get the follwing error: `Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: file_location_and_name.csv line 2 expected 575 columns of data but found 747` – Hessam Jun 16 '18 at 21:41
  • @Hack-R: Thanks. I am not sure how to make my questions reproducible since I am using a large local file. In response to your question, `dim(Housing)` after I run my first example code returns `NULL` and `class(Housing)` returns `[1] "file" "connection"`. I can open my file with read.csv(), giving me a data.frame of dim 33121 by 575. – Hessam Jun 16 '18 at 22:20
  • @Hessam Right ok, so the file was never imported – Hack-R Jun 16 '18 at 22:35

1 Answers1

0

You never really imported the file as a data.frame like you think.

You've opened a connection to a file. You mentioned that it is a CSV. Your code should look something like this if it is a normal CSV file:

Housing          <- read.csv("my_file.csv")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing')

If there's something non-standard about this CSV file please mention what it is and how it was created.

Also, to another point that was made in the comments, if you do for some reason need to manually input the line breaks use \n where you were using /n. Any error is not being caused by that change, but rather you're getting passed 1 problem and on to another, probably due to improperly handling missing data, space, commas in text fields that aren't being handled, etc.

If there are still data errors can you please use R code to create a small file that is reflective of the relevant characteristics of your data and which produces the same error when you import it? This may help.

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • 1
    It was a normal CSV and your code worked. `"\n"` also worked with no error. Thanks! I was trying to avoid using `read.csv` as it takes a long time to load my CSV files (I have to read about 10 CVS files of dim 33121 by 575). That's why I was also tried `read.csv.sql` to load each file into a temporary SQLite database and then reads them into R. But got this error message `Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: file_location_and_name.csv line 2 expected 575 columns of data but found 747`. Let me create a reproducible example and post it again. – Hessam Jun 17 '18 at 00:38