0

I am struggling to read a large (22MB) datafile in R with read.csv. I know there are a bunch of similar questions but none of the ones I found seem to help me solve my problem (one, two, three, My file has 6 columns, all but one are integers, and 330,001 rows. I can open the file in excel. There are no empty cells anymore and every column except the first one (id) has NA values.

When I use the following code:

mt <- read.csv("C:/path/master.csv", header=T, sep=",", encoding="utf-8")

I get 79,024 rows Changing this to the following (see here)

mt <- read.csv("C:/Users/slebex/Desktop/PhB Data/master.csv", header=T, sep=",", quote="", encoding="utf-8")

Increases the rowcount to 104,510 but reads all my integers as factors (or characters when I add stringsAsFactors = F).

Additionally, using the following code gives the below warning message and loads a large character of 26,234 elements only

mt <- readLines(file("C:/Users/slebex/Desktop/PhB Data/master.csv", encoding="utf-8"))

Warning messages:
1: invalid input found on input connection 'C:/path/master.csv' 
2: In readLines(file("C:/path/master.csv",  :
  incomplete final line found on 'C:/path/master.csv'

The datatable looks like this head(mt)

  id App_date Grant_date Prior_date Num_inventors Assignee
1  1       NA   18370630         NA             1     <NA>
2  2       NA   18371028         NA             1     <NA>
3  3       NA   18371028         NA             1     <NA>
4  4       NA   18380109         NA             1     <NA>
5  5       NA   18380203         NA             1     <NA>
6  6       NA   18380210         NA             1     <NA>

tail(mt)

         id App_date Grant_date Prior_date Num_inventors
79019 79019 19990826   20010206   19990826             2
79020 79020 19990127   20010206   19920501             2
79021 79021 19980213   20010206   19951002             4
79022 79022 19941108   20010206   19931108             4
79023 79023 19941208   20010206   19901025             1
79024 79024 19980918   20010206   19931214             1
                                                                                                                                 Assignee
79019                                                                                                          Novo Nordisk Biotech, Inc.
79020                                                                                          Trustees Of The University Of Pennsylvania
79021 Mohammad W. Katoot, Katoot, Administrator Karen Robbyn Goodan, Katoot, Administrator Ali Maroof, Katoot, Administrator Ahmed Maroof
79022                                                                                                                   Mcgill University
79023                                                                         The Trustees Of Columbia University In The City Of New York
79024                                                                                                         Centr Embrionalnikh Tkaney 

As you can see, the Assignee variable contains various forms of punctuation. Perhaps this causes a problem I'm not sure (see comments here). I removed all double spaces, changed all commas into semicolons, and removed all quotation marks, but that has not helped.

Following this question i did the following

library(readr)
mt <- read_csv("C:/Users/slebex/Desktop/PhB Data/master.csv")

This gives me the following error

Warning message:
660101 problems parsing 'C:/Users/slebex/Desktop/PhB Data/master.csv'. See problems(...) for more details.

Despite the error the dataset gets fully loaded but two of my columns suddenly consist entirely of NA values.

If relevant find below the sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252   
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.2.0

Any suggestions would be welcomed

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31
  • Did you try using read.csv without the encoding argument? And try including the strip.white=T argument. – vagabond May 05 '15 at 11:23
  • I did it without encoding and with various variants of encoding (`= Unicode`, and `=utf-8`) that did not work. I never included `strip.white=T` because i had no empty cells but it could work of course. – SJDS May 05 '15 at 11:30
  • Also try read.table from data.tabe – vagabond May 05 '15 at 11:31
  • try `fread()` in the `data.table` package – nsheff May 05 '15 at 11:39
  • Ooh! Interesting problem, it would probably be helpful to raise this case as an issue on readr as I doubt it's come up before. https://github.com/hadley/readr/issues – Shorpy May 05 '15 at 13:30
  • Thanks for all the suggestions! Will raise this issue later on the github @Shorpy . If anyone could provide help with correctly reading the file when there are commas in one column (between " ") which cannot be deleted? When including `quote=""` all rows are read but incorrectly. When that is omitted a subset of rows is not read... – SJDS May 05 '15 at 16:25
  • Thanks guys, I'm trying to work with read.csv as it seems easiest but it remains tricky to deal with commas between quotation marks. Adding `quote = ""` ensures the entire dataset is read but incorrectly. When I omit that R does not read the entire dataset. So I need a specification that recognises commas as separator but not when they are between quotation marks... Any ideas? – SJDS May 05 '15 at 16:27

1 Answers1

0

I found a solution that works but I guess better options to overcome the problem are available. Basically I opened the data in wordpad and realised that the problem with read.csv was that some Assignee names have an arrow symbol in them. I think these get read as breaks or so, which interrupts the dataframe structure. Once all these arrow symbols were deleted I could read in the complete dataset.

SJDS
  • 1,239
  • 1
  • 16
  • 31