0

I often use read.csv function to read large CSV files. The files are without header and therefore by using col.names parameter I define properly the name of the variables in the dataframe that would be created after import.

Today, for the first time, I had to use read.csv.sql which is available in sqldf package. The file to import is very big and I only need certain rows based on a condition in that file. According to the online documentation, the filter has to be defined in the WHERE clause of the SELECT statement. Let's say that I have a column in my file (among other columns) which is user_account and I want to import only rows where the condition user_account = 'Foo' is satisfied. Therefore, I have to write something like

df <- read.csv.sql(
    "my_big_data_file.csv",
    sql = "select * from file where user_account = 'Foo'",
    header = FALSE,
    colClasses = c(... Here I define column types ...),
    sep = "|",
    eol = "\n"
)

Now the problem is, unlike read.csv apparently there is no col.names parameter in read.csv.sql. And given that my file has no header I don't know how to refer to column names. I get an error message as I have written user_account in the WHERE clause of the sql parameter in the above code. R complains that there is no such variable.

So, how can I refer to column names using read.csv.sql for a CSV file without header and at the same time referring to those column names in my filter? Is this even possible?

Thanks in advance

user17911
  • 1,073
  • 1
  • 8
  • 18
  • 1
    after the import, add a line `colnames(df) <- c("col1", "col2", .....)` – Ric Oct 26 '22 at 18:14
  • Does this answer your question? [Changing column names of a data frame](https://stackoverflow.com/questions/6081439/changing-column-names-of-a-data-frame) – Ric Oct 26 '22 at 18:17
  • So if I understand correctly, you suggest that I change the column name after the import. But then how should I specify the filter during the import? With what name? – user17911 Oct 26 '22 at 18:33
  • Then you can filter your data frame with `subset()` function like `df |> subset(col1 == "Ala" & col2 >= 7)`. The other way is to use `dplyr::filter()` function. BTW, for big datasets/data gathered within several sessions, I have found `duckdb` very helpful. – Grzegorz Sapijaszko Oct 26 '22 at 18:59
  • What you're suggesting can be done "after" the data import when the entire huge file has been imported into the dataframe. My intention is to filter "during" the import and pick only rows satisfying the condition in the above-mentioned code. – user17911 Oct 26 '22 at 19:10
  • read.csv.sql no longer works since SQLite's corresponding functionality was changed in a non-backwards compatible way. – G. Grothendieck Oct 27 '22 at 17:30
  • I use R version 4.2.1 wih sqldf_0.4-11 and tidyverse_1.3.2. I just imported a rather big data file and it seems to me that it works. Do you have any link to the official documentation saying that it doesn't work anymore? I just posted my solution below as the answer. – user17911 Oct 27 '22 at 17:55
  • @user17911, I just tried it and you are right. It is working again! This must have been addressed in a recent version of RSQLite. sqldf issue #45 on github documented this but now that it seems fixed I have closed that issue. – G. Grothendieck Oct 27 '22 at 22:21

1 Answers1

0

Finally I found the answer in the documentation of read.csv.sql. Instead of colClasses one has to use fields.types by specifying directly data types as they are defined in SQLite and not in R.

field.types: A list whose names are the column names and whose contents are the SQLite types (not the R class names) of the columns. Specifying these types improves how fast it takes. Unless speed is very important this argument is not normally used.

SQLite data types are available here

Therefore I modified my program accordingly :

df_tmp <- read.csv.sql(
    file = input_file_path,
    sql = "
       select
           *
       from
           file
       where trim(lower(user_account)) = 'foo'",
    header = FALSE,
    sep = "|",
    eol = "\n",
    `field.types` = list(
        col1 = c("TEXT"),
        col2 = c("TEXT"),
        user_account = c("TEXT"),
        col4 = c("REAL"),
        col5 = c("REAL")
    ),
    dbname = tempfile(),
    drv = "SQLite"
)

However, at the end I had to convert explicitly via as.numeric one variable that had been converted to character. But the program indicated this by a clear warning message. So at the end, this solution did the job for me.

I hope this might help those who have encountered the same problem.

user17911
  • 1,073
  • 1
  • 8
  • 18
  • 1
    Note that the where condition in the question can never be satisfied since it lower cases user_account and then after trimming compares it to a string that has an upper case character. Also just a guess but perhaps the last line of the file does not end in a newline. That could confuse it and it might think that the last column is not a real. – – G. Grothendieck Oct 27 '22 at 23:11
  • Many thanks for the remark. My bad. I just typed rapidly and I didn't pay attention that in the original post I had written 'Foo' and not 'foo'. I changed it all to lowercase as it was just an example to show what I was trying to achieve. As your other remark about the new line, in this case I can be sure that the program generating these files, will always insert a new line at the end. Therefore, no problem in this regard. Many thanks for your remark. – user17911 Oct 28 '22 at 07:54
  • 1
    You might want to check https://github.com/ggrothendieck/sqldf/issues/52 regarding the column type issue. – G. Grothendieck Oct 28 '22 at 12:02
  • The problem with the type I encountered was indeed strange. It was normally a numeric type at the last column. After the import, R printed a message saying that in this last column, the first encountered value was integer, then real values were also detected and at the end the whole column was converted to character! I had to use explicitly the as.numeric() function at the end of import to convert the column to numeric. So you mean the absence of one or several new lines at the end of one or several rows in the data file could have caused this problem ( = line_N concatenated with line_N+1)? – user17911 Oct 28 '22 at 17:39
  • I think it only checks the first row and if that row differs in type from the rest then you can get problems. Also for completely different approaches have a look at https://stackoverflow.com/questions/73972162/trying-to-read-20gb-of-data-read-csv-sql-produces-errors/73978487#73978487 – G. Grothendieck Oct 28 '22 at 18:12