0

I am working with a data set with a combined 300 million rows, split over 5 csv files. The data contains weight measurements of users over 5 years (one file per year). As calculations take ages in this massive data set, I would like to work with a subset of users to create the code. I've used the nrows function to import only the first 50000 lines of each file. However, one user may have 400 weight measurements in the file for year 2014 but only 240 in year 2015. I therefore don't get the same set of users from each file when I import with the nrows function. I am wondering whether there is a way to import the data of the first 1000 users in each file? The data looks like this in all files:

user_ID                                         date_local    weight_kg
0002a3e897bd47a575a720b84aad6e01632d2069        2016-01-07    99.2         
0002a3e897bd47a575a720b84aad6e01632d2069        2016-02-08    99.6
0002a3e897bd47a575a720b84aad6e01632d2069        2016-02-10    99.5  
000115ff92b4f18452df4a1e5806d4dd771de64c        2016-03-13    99.1     
000115ff92b4f18452df4a1e5806d4dd771de64c        2016-04-20    78.2    
000115ff92b4f18452df4a1e5806d4dd771de64c        2016-05-02    78.3       
000115ff92b4f18452df4a1e5806d4dd771de64c        2016-05-07    78.9       
0002b526e65ecdd01f3a373988e63a44d034c5d4        2016-08-15    82.1       
0002b526e65ecdd01f3a373988e63a44d034c5d4        2016-08-22    82.6     

Thanks a lot in advance!

Ke_Fr
  • 5
  • 4
  • @neilfws I believe OP wants to read as many rows as necessary such that he has complete data for 1000 users (where data for a single user spans multiple rows). He states that "*[he's] used the nrows function to import only the first 50000 lines of each file. However, this means that I don't always have the complete data set of the included users.*" So I'm not sure this is an exact duplicate question. – Maurits Evers Aug 29 '18 at 23:02
  • 1
    @MauritsEvers Oops, you are correct :) re-opened – neilfws Aug 29 '18 at 23:03
  • @MauritsEvers thanks for reopening. I know the nrows function, but it doesn't solve my problem as users don't always have the same amount of rows in each file, i.e. one user may have 400 weight measurements in the file for year 2014 but only 240 in year 2015. I therefore don't get the same set of users from each file when I import with the nrows function. – Ke_Fr Aug 29 '18 at 23:05
  • @neilfws Thanks; my first reaction was also "easy and has been asked before";-) perhaps `readLines` can be useful here... – Maurits Evers Aug 29 '18 at 23:07

1 Answers1

0

If you have grep on your system you can combine it with pipe and read.table to read only rows that match a pattern. Using your example data, for example, you could read only users 001 and 002 like this. You'll need to add the headers back later as they won't match the pattern.

mydata <- read.csv(pipe('grep "^00[12]" "mydata.csv"'), 
                     colClasses = c("character", "Date", "numeric"),
                     header = FALSE)

I'm not sure what the pattern is for your user_ID: you give 001 as an example but state that you want the first 1000. If that is 0001 - 1000, a pattern for grep might be something like ^[01][0-9]{3}.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Thanks for coming up with a solution. Unfortunately the user_IDs are a bit more complicated. I wanted to make the sample data simple, sorry about that. In reality the IDs are a random combination of 40 letters and numbers, e.g. 000115ff92b4f18452df4a1e5806d4dd771de64c or 0002a3e897bd47a575a720b84aad6e01632d2069 – Ke_Fr Aug 29 '18 at 23:44