0

I am working on a project where I need to graph 10 days worth of data from remote sites.

I am downloading new data every 30 minutes from a remote computer via FTP (data is written every half hour also). The local (onsite) file path changes every month so I have a dynamic IP address based on the current date.

eg.

/data/sitename/2020/July/data.csv
/data/sitename/2020/August/data.csv

My problem is at each new month the csv I am downloading will be in a new folder and when I FTP the new csv file, it will only contain data from the new month and not the previous months.

I need to graph the last 10 days of data. So what I'm hoping to do is download the new data every half hour and only append the newest records to the master data set. Or is there a better way all together?

What I (think I) need to do is download the csv into R, and append only the new data to a master file and remove the oldest records so as to only contain 10 days worth of data in the csv. I have searched everywhere but cannot seem to crack it.

This seems like it should be so easy, maybe I am using the wrong search terms.

I would like the following pretty please (showed 10 lines of data, I'll need 480 for 10 days).

INITIAL DATA

                        DateTime   Data1 Data2    Data3  Data4   Data5
641 2020-08-26T02:31:59.999+10:00 10.00      53.4 3.101   42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11      52.0 2.778   44 20.70
643 2020-08-26T03:31:59.999+10:00  2.63     105.5 2.899   45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11      60.5 2.920   45 20.70
645 2020-08-26T04:31:59.999+10:00  3.03     101.3 2.899   48 20.70
646 2020-08-26T05:01:59.999+10:00  2.86     125.2 2.899   49 20.65
647 2020-08-26T05:31:59.999+10:00  2.86     132.2 2.899   56 20.65
648 2020-08-26T06:01:59.999+10:00  3.23     113.9 2.963   61 20.65
649 2020-08-26T06:31:59.999+10:00  3.45     113.9 3.008   64 20.65
650 2020-08-26T07:01:59.999+10:00  3.57     108.3 3.053   66 20.65

NEW DATA

                         DateTime   Data1 Data2    Data3  Data4   Data5
641 2020-08-26T02:31:59.999+10:00 10.00      53.4 3.101   42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11      52.0 2.778   44 20.70
643 2020-08-26T03:31:59.999+10:00  2.63     105.5 2.899   45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11      60.5 2.920   45 20.70
645 2020-08-26T04:31:59.999+10:00  3.03     101.3 2.899   48 20.70
646 2020-08-26T05:01:59.999+10:00  2.86     125.2 2.899   49 20.65
647 2020-08-26T05:31:59.999+10:00  2.86     132.2 2.899   56 20.65
648 2020-08-26T06:01:59.999+10:00  3.23     113.9 2.963   61 20.65
649 2020-08-26T06:31:59.999+10:00  3.45     113.9 3.008   64 20.65
650 2020-08-26T07:01:59.999+10:00  3.57     108.3 3.053   66 20.65
651 2020-08-26T07:31:59.999+10:00  3.85     109.7 3.125   70 20.65

REQUIRED DATA

                         DateTime   Data1 Data2    Data3  Data4   Data5
642 2020-08-26T03:01:59.999+10:00 11.11      52.0 2.778   44 20.70
643 2020-08-26T03:31:59.999+10:00  2.63     105.5 2.899   45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11      60.5 2.920   45 20.70
645 2020-08-26T04:31:59.999+10:00  3.03     101.3 2.899   48 20.70
646 2020-08-26T05:01:59.999+10:00  2.86     125.2 2.899   49 20.65
647 2020-08-26T05:31:59.999+10:00  2.86     132.2 2.899   56 20.65
648 2020-08-26T06:01:59.999+10:00  3.23     113.9 2.963   61 20.65
649 2020-08-26T06:31:59.999+10:00  3.45     113.9 3.008   64 20.65
650 2020-08-26T07:01:59.999+10:00  3.57     108.3 3.053   66 20.65
651 2020-08-26T07:31:59.999+10:00  3.85     109.7 3.125   70 20.65

This is where I am at...

library(RCurl) 
library(readr)
library(ggplot2)
library(data.table) 

# Get the date parts we need
Year <-format(Sys.Date(), format="%Y")
Month <- format(Sys.Date(), format="%B")
MM <- format(Sys.Date(), format="%m")

# Create the file string and read
site <- glue::glue("ftp://user:passwd@99.99.99.99/path/{Year}/{Month}/site}{Year}-{MM}.csv")
site <- read.csv(site, header = FALSE)

# Write table and create csv
EP <- write.table(site, "EP.csv", col.names = FALSE, row.names = FALSE)
EP <- fread("EP.csv", header = FALSE, select = c( 1, 2, 3, 5, 6, 18))
output<- write.table(EP, file = 'output.csv', col.names = c("A", "B", etc), sep = ",", row.names = FALSE)
#working up to here

# Append to master csv file
master <- read.csv("C:\\path\\"master.csv")
MrFlick
  • 195,160
  • 17
  • 277
  • 295
prezzo
  • 1
  • 1

2 Answers2

0

You can turn the DateTime column to POSIXct class, combine the new and initial data and get data which is present in last 10 days.

library(dplyr)
library(lubridate)

initial_data <- initial_data %>% mutate(DateTime = ymd_hms(DateTime))
new_data <- new_data %>% mutate(DateTime = ymd_hms(DateTime))
combined_data <- bind_rows(new_data, initial_data)

ten_days_data <- combined_data %>% 
                   filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi Ronak, thanks for your help. i have tried your suggestion, but when combining the datasets, the newwest is just added to the bottom of the first, with all data repeated, not just the newest data appended. – prezzo Aug 26 '20 at 04:30
  • That's weird. What is `class(new_data)` and `class(initial_data)`. Or instead of `bind_rows` try with `rbind` i.e `combined_data <- rbind(new_data, initial_data)` – Ronak Shah Aug 26 '20 at 04:44
  • new_data and initial_data are both data.frame. I just tried with rbind and got the same result. – prezzo Aug 26 '20 at 04:50
  • What is present in `ten_days_data` ? It should have data for 10 days from today to last 10 days. – Ronak Shah Aug 26 '20 at 04:53
  • ten_days_data did have only the last 10 days of data (with a timezone problem) but also had duplicates due to the same data being present in the combined_data – prezzo Aug 26 '20 at 05:15
  • You can set your timezone by `ymd_hms(DateTime, tz = 'time_zone')`. To remove duplicates try `unique(ten_days_data)` or `ten_days_data %>% distinct`. – Ronak Shah Aug 26 '20 at 05:17
  • I have got it working from this post https://stackoverflow.com/questions/39452952/how-to-rbind-new-rows-from-one-data-frame-to-an-existing-data-frame-in-r But it seems a weird way of doing it for me. ```combined_data <- unique(rbindlist(list(inital_data, new_data)), by = "DateTime")``` – prezzo Aug 26 '20 at 05:27
  • You can use `combined_data <- bind_rows(new_data, initial_data) %>% distinct(DateTime, .keep_all = TRUE)` in `dplyr`. – Ronak Shah Aug 26 '20 at 05:39
  • I tried that and it does work, but took 11 seconds as opposed to 2 seconds for the rbindlist approach. Thanks – prezzo Aug 26 '20 at 06:05
0

I'll try and answer this combining the help from Ronak. I am still hopeful that a better solution can be found where I can simply append the new data to the old data.

There were multiple parts to my question and Ronak provided a solution for the last 10 days problem:

ten_days_data <- combined_data %>% 
                   filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))

The second part about combining the data I found from another post How to rbind new rows from one data frame to an existing data frame in R

combined_data <- unique(rbindlist(list(inital_data, new_data)), by = "DateTime")
prezzo
  • 1
  • 1