1

I have a dataset with more than 200k lines, and being updated constantly. So, I need to read last 3000 rows of this dataset. Reading 200k lines and filtering dataframe for 3000 rows is time consuming. Instead, I want to directly read last 3000 rows. Is there a way to achieve this?

Thanks in advance.

Turan
  • 104
  • 9
  • https://stackoverflow.com/questions/62529816/skip-rows-while-use-read-excel-or-read-excel-in-r – Julian Aug 03 '22 at 07:00
  • Does this answer your question? [Skip rows while use read\_excel or read.excel in R](https://stackoverflow.com/questions/62529816/skip-rows-while-use-read-excel-or-read-excel-in-r) – deschen Aug 03 '22 at 07:01
  • Thanks for your replies. I need a generic one. Since my data set is being updated every second, I dont have the information about the number of rows. Thus, I cannot use 'skip' because i do not know how many rows i have to skip. for example, if my dataset has 5k lines i have to skip 2k but if my dataset have 10k lines i have to skip 7k lines. – Turan Aug 03 '22 at 07:05
  • What OS are you using? Sounds like basically you want the unix `tail` command. There's a question about that [here](https://stackoverflow.com/questions/35891166/r-use-shell-command-to-tail) though with no accepted answer. On Linux/Mac it should be fairly straightforward and also I suspect doable on Windows 10+. – SamR Aug 03 '22 at 07:13
  • Is it a .xls, .xlsx or .csv file ? – Basti Aug 03 '22 at 07:15
  • I use Windows10 and the file is xlsx – Turan Aug 03 '22 at 07:16
  • Ah you can only use `tail` with plain text files so it won't work with a `.xlsx` file. You could open the whole file, check how many rows it has, then open it again only using the last rows. But if it's being updated every second and it takes more than a second to open (which is likely for a file large enough to need to do this) then you're going to have problems... Also by the time you've opened the whole file there's not much point loading it again. – SamR Aug 03 '22 at 07:19

1 Answers1

1

This would be my approach. First read in first column of each dataset and store the number of rows (nrow) in a list. Using map2 we can read in the data while simultaneously use no_rows to get a number to skip while reading in.

library(readxl)
library(purrr)


files <- list.files(pattern = "*.xlsx")

no_rows <- map(files, ~nrow(readxl::read_excel(.x, range = cellranger::cell_cols(1))))

# read in last three thousands rows

map2(files, no_rows, ~readxl::read_excel(.x, skip = .y - 3000 ))
Julian
  • 6,586
  • 2
  • 9
  • 33