0

I have some Time Series to work on. In particular, I have one univariate time series, saved in a .csv file, consisting in just a single column, and containing >1M rows. In fact, when I try to open that csv with Excel, I get the "cannot display all records" popup. I can just view 1048576 records. I use R and RStudio for analytics, so I tried to import this dataset into RStudio environment. Fun fact, i can only view exactly the same number of rows as i did using programs like Excel.

One simple workout I found, was to split the original csv file using the split bash command. So:

split -l 500000 bigdata.csv

produced 4 smaller csv files (the first 3 files containing 500k records), which I easily managed to import in 4 different RStudio Time Series (that I finally merged, obtaining the wanted result).

My question is: there is something I can do to avoid all this process, and directly load such a dataset with no final rows loss?? I already tried the data.table library, with the fread() function to load the dataset, but there were no benefit: same number of rows were loaded.

I am using RStudio on a Windows 10 machine, with 6 GB of RAM.

EDIT: I tried memory.limit() cmd to check the amount of memory avaiable to RStudio use. Result is "6072", corresponding to my 6 GB of RAM.

  • When you load the .csv into R not all the rows are imported? – Chabo May 30 '18 at 16:46
  • No, the resulting object, in RStudio, is truncated, in the same way as Excel, or Notepad, does. I lose the final 500k observations. –  May 30 '18 at 16:48
  • Are you using a read.csv() line? or importing via file explorer? – Chabo May 30 '18 at 16:59
  • I tried both. No difference. I edited the question to point out that it's not a memory limit matter. –  May 30 '18 at 17:01
  • What is the size of the data (mb?) – Chabo May 30 '18 at 17:01
  • I'm guessing you have checked the length of the df with nrow(), just checking – Chabo May 30 '18 at 17:06
  • No, I have the time series length from the Environment section in RStudio. nrow() result is the same, by the way. –  May 30 '18 at 17:12
  • https://stackoverflow.com/questions/16945348/excel-csv-file-with-more-than-1-048-576-rows-of-data?rq=1 Maybe see this? – Chabo May 30 '18 at 17:16
  • Already saw this discussion. I would like not to move data on a db in order to just retrieve them. I was searching for some R library or function that directly permits to manage such files. –  May 30 '18 at 17:22
  • I see, so the problem is not really loading the data into R, but in creating a csv file with more than the allowed amount of rows (excel, notepad)? R should be able to load in the csv with all the rows as @Priyanka showed, therefore it is either a problem with the .csv itself, or something in the code. Have we narrowed it down? But then again, splitting it seemed to work so that has me confused.. – Chabo May 30 '18 at 17:27
  • I do not think there can be some issue with the code: I just tried commands in RStudio console. On the other hand, data seems good too: Importing parts of the .csv with the split technique goes smooth. I am confused too. –  May 30 '18 at 17:30
  • library(ff), your_data <- read.csv.ffdf(file = 'your_file.csv', header = T). Found this, maybe it will work who knows. Credit: https://www.biostars.org/p/221009/ – Chabo May 30 '18 at 17:35
  • 1
    I am giving it a try, thanks a lot for your help and time. –  May 30 '18 at 17:40

1 Answers1

0

I just did this, it worked in RStudio and Visual Studio with R:

df <-  read.csv("P:\\ALL.txt", header = TRUE)

My text file has 1072095 rows and none of them are truncated in 'df'

krpa
  • 84
  • 1
  • 13
  • I tried this command too, result is always the same. I cannot figure out the reason. EDIT: I had to add the parameter "row.names = NULL", because i got the "duplicate 'row.names' are not allowed" error. –  May 30 '18 at 17:05
  • can you please post exactly what error you are getting. coz the problem seems to be specific to the system you are working on. – krpa May 30 '18 at 17:08
  • Fact is I get absolutely no error. I just get a wrong result (loss of last 500k observations). Csv file contains about 1.5M rows. –  May 30 '18 at 17:11