I am struggling to read a large (22MB) datafile in R with read.csv
. I know there are a bunch of similar questions but none of the ones I found seem to help me solve my problem (one, two, three,
My file has 6 columns, all but one are integers, and 330,001 rows.
I can open the file in excel. There are no empty cells anymore and every column except the first one (id) has NA
values.
When I use the following code:
mt <- read.csv("C:/path/master.csv", header=T, sep=",", encoding="utf-8")
I get 79,024 rows Changing this to the following (see here)
mt <- read.csv("C:/Users/slebex/Desktop/PhB Data/master.csv", header=T, sep=",", quote="", encoding="utf-8")
Increases the rowcount to 104,510 but reads all my integers as factors (or characters when I add stringsAsFactors = F
).
Additionally, using the following code gives the below warning message and loads a large character of 26,234 elements only
mt <- readLines(file("C:/Users/slebex/Desktop/PhB Data/master.csv", encoding="utf-8"))
Warning messages:
1: invalid input found on input connection 'C:/path/master.csv'
2: In readLines(file("C:/path/master.csv", :
incomplete final line found on 'C:/path/master.csv'
The datatable looks like this
head(mt)
id App_date Grant_date Prior_date Num_inventors Assignee
1 1 NA 18370630 NA 1 <NA>
2 2 NA 18371028 NA 1 <NA>
3 3 NA 18371028 NA 1 <NA>
4 4 NA 18380109 NA 1 <NA>
5 5 NA 18380203 NA 1 <NA>
6 6 NA 18380210 NA 1 <NA>
tail(mt)
id App_date Grant_date Prior_date Num_inventors
79019 79019 19990826 20010206 19990826 2
79020 79020 19990127 20010206 19920501 2
79021 79021 19980213 20010206 19951002 4
79022 79022 19941108 20010206 19931108 4
79023 79023 19941208 20010206 19901025 1
79024 79024 19980918 20010206 19931214 1
Assignee
79019 Novo Nordisk Biotech, Inc.
79020 Trustees Of The University Of Pennsylvania
79021 Mohammad W. Katoot, Katoot, Administrator Karen Robbyn Goodan, Katoot, Administrator Ali Maroof, Katoot, Administrator Ahmed Maroof
79022 Mcgill University
79023 The Trustees Of Columbia University In The City Of New York
79024 Centr Embrionalnikh Tkaney
As you can see, the Assignee
variable contains various forms of punctuation. Perhaps this causes a problem I'm not sure (see comments here). I removed all double spaces, changed all commas into semicolons, and removed all quotation marks, but that has not helped.
Following this question i did the following
library(readr)
mt <- read_csv("C:/Users/slebex/Desktop/PhB Data/master.csv")
This gives me the following error
Warning message:
660101 problems parsing 'C:/Users/slebex/Desktop/PhB Data/master.csv'. See problems(...) for more details.
Despite the error the dataset gets fully loaded but two of my columns suddenly consist entirely of NA
values.
If relevant find below the sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.0
Any suggestions would be welcomed