-2

I am having a .csv file with 1009725 rows and 85 columns in it. I am trying to read it with Microsoft R open. I used the following command,

data= read.csv("C:/Users/username/Desktop/data.csv")

But the number of rows that gets read are only 617200 (~65%). I am not sure what is the reason behind the file not getting completely read. The data columns are mostly integers like IDs, values and some of them have text. Can anybody help me in diagnosing the problem?

Also even for the 617200 rows, I am having so much of performance issues. Even the basic command such as,

nrow(data) or length(unique(data$column1)) puts the Rstudio to Not responding status. The configuration of my system is 16GB RAM and i7 quad core processor. I feel this should be sufficient to crunch this data. But why am I not able to even run even the basic commands of the partial data that gets read ? Can anybody help me in diagnosing both the problems?

Thanks

haimen
  • 1,985
  • 7
  • 30
  • 53
  • Don't have enough info: Data is numeric or text? What other programs are running in the background of which operating system? (And _Please_ do not use comments to clarify questions.) And have you done any searching in SO for example: "[r] performance issues" – IRTFM Apr 12 '16 at 20:01
  • Does this persist in a new/empty R-session? – Heroka Apr 12 '16 at 20:01
  • @42- I have added the data types. The background applications are outlook and chrome only. I searched few and tried to tweak few parameters without much use. – haimen Apr 12 '16 at 20:13
  • @Heroka Yes. It does the same – haimen Apr 12 '16 at 20:13
  • 1
    The error is probably bad data somewhere, the problem is to find it. Start off a small sample. Cut the file into a file of only 100 lines and all the columns. See if you can read that, all of it. See if the column sums agree with a spreadsheet read. If that doesn't work you should be able to find the error optically. If that works double the size. See if there are any problems. If there are you should be able to find them by looking in the spreadsheet again. Correction of the data can be done using tools like awk or sed. Rinse and repeat. You are likely to find the problems very soon. – Mike Wise Apr 12 '16 at 20:19

1 Answers1

1

Adding a colClasses parameter to a read.csv call is likely to improve speed. Several other useful answers appear in Quickly reading very large tables as dataframes in R.

I find it useful to use this with varying arguments to the comment.char (default is "#") and quote arguments (default is "\"") using count.fields:

 table( count.fields("C:/Users/username/Desktop/data.csv", 
                            comment.char="", quote="", sep="," ))

The table wrapper prevents a long string of integers and summarizes the consistency of your line structures. After identifying rows with problems (such as a short line with only 30 characters) you can do:

which( count.fields("C:/Users/username/Desktop/data.csv", 
                            comment.char="", quote="", sep="," ) == 30)
Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487