1

I have a text file (76 rows) in CSV format containing following columns. Date contains the range of years (2003 - 2012 in this case) and ID contains the year of every row.

Date                    ID                values
2003-06-07 00:00:00     1697144#6_2003    240
2004-01-01 00:00:00     1697144#6_2004    240
2005-05-27 00:00:00     1697149#6_2005    240
2006-01-01 00:00:00     1697149#6_2006    240
2007-01-01 00:00:00     1697149#6_2007    240
2008-01-01 00:00:00     1697149#6_2008    240
2009-01-01 00:00:00     1697149#6_2009    240
2010-01-01 00:00:00     1697149#6_2010    240
2011-01-01 00:00:00     1697149#6_2011    240
2012-01-01 00:00:00     1697149#6_2012    240
2003-06-07 00:00:00     1697158#6_2003    240
2004-01-01 00:00:00     1697158#6_2004    240
2003-06-07 00:00:00     1697163#6_2003    240

Using R, I would like to break this text file into multiple files of each year so that every file (for example: my_file_2003.csv) contains unique records (date, id and values) of that specific year only. This question (Stack Overflow) demonstrates how to split R dataframe into multiple files. Also this link demonstrates splitting large CSV files but in my case I need to merge all records of a specific year into a single file. Being a newbie to R, can someone help me how to break this single text file into multiple files of each year? I am using R version 3.2.3 on Windows 7 (x64).

Community
  • 1
  • 1
khajlk
  • 791
  • 1
  • 12
  • 32

2 Answers2

1

All you have to do is read the csv and then subset your dataframe so that only rows with matching years are written. This should work, but I'm a bit rusty at R so I may have made a dumb syntax error somewhere.

df <- read.csv("path.csv")
uniqueYears <- unique(as.numeric(format(strptime(df$date, format="%Y-%m-%d %H:%M:%S"), format="%Y")))
for(i in uniqueYears){
    yeardf <- df[as.numeric(format(strptime(df$date, format="%Y-%m-%d %H:%M:%S"), format="%Y")) == i,]
    write.csv(yeardf, paste("path", i, ".csv", sep=""))
}

Edited based on your comment. Probably not the most efficient solution, but it should work fine.

  • No you did not make a syntax error 'cause it worked for me. However, you restricted the loop from 2003 to 2012 which may not be the case always. How do I make this loop flexible so that it searches for min and max value of years (in input file column) and then generates the output files automatically? – khajlk Aug 12 '16 at 17:39
  • Edited to unrestrict the loop. – Henry Prickett-Morgan Aug 12 '16 at 17:44
  • 1
    BTW you can do this much more efficiently if the data is ordered by year and has every year in between, by just grabbing the year from the first and last values and just looping over `firstYear:lastYear`, but this is more robust if it is unordered or missing years. – Henry Prickett-Morgan Aug 12 '16 at 17:51
  • +1 for your suggestion. Now a syntax error. Unexpected "{" error. Im trying to figure it out. I hope its not data formatting error. – khajlk Aug 12 '16 at 18:00
  • Thank you for your help. Resolved the error and it's working for me. – khajlk Aug 13 '16 at 10:26
1

With dplyr and lubridate:

Toy data:

dat <- data.frame(date = seq.Date(from = as.Date("2010-01-01"), 
                                  to = as.Date("2013-01-01"), length.out = 10), 
                  data = letters[1:10])
dat

         date data
1  2010-01-01    a
2  2010-05-02    b
3  2010-09-01    c
4  2011-01-01    d
5  2011-05-03    e
6  2011-09-01    f
7  2012-01-01    g
8  2012-05-02    h
9  2012-09-01    i
10 2013-01-01    j

To write the .csv files (each is named YEAR.csv and saved to the working directory):

library(dplyr)
library(lubridate)
dat %>% 
group_by(year = year(date)) %>% 
do(df = data.frame(.)) %>% 
do(csvs = write.csv(x = .$df, file = paste0(.$year, ".csv")))
Michael Veale
  • 929
  • 4
  • 11