0

I have a huge dataset and I would like to insert a dummy variable column based on a set of conditions:

I have my main df (A) in which I have 5 million rows and 10 columns, where 4 of them are date;hour;minute;second and these go from 2020 to 2023.

enter image description here

On the other df (B) I have the same columns but I have only 30 rows.

I want A to look at B and put a 1 to all the rows where date,hour,minute,second match the date,hour,minute,second of B, and 0 to all the rest. so, in the end, i should find my self with a column where I have 30 1 and 4.999.970 0

Even better would be to have like date,hour,minute matching exactly, and second matching "more or less" (say like +/- 5 seconds)

Can you help please?

I thought a solution could have been:

A$dummy <- for (i in A){

ifelse("A$date"=="B$date"&"A$hour"=="B$hour"&
"A$minute"=="B$minute"&or("A$second">="B$second"-5,"A$second"<="B$second"+5),1,0)
}
Mining
  • 115
  • 5
  • 1
    convert your date/hour/munite/second columns to a POSIX-timestamp, and then join/merge both dataframes, using the answers in https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right – Wimpel Aug 25 '23 at 08:31
  • 1) `A$dummy <- for` is meaningless, `for` always returns `NULL`; 2) `i in A` gives you all columns of `A`, one at a time; 3) `"A$date"`, `"B$date"` are strings not df columns and `"A$date"=="B$date"` always compares to false. – Rui Barradas Aug 25 '23 at 11:18

2 Answers2

1

Here is a solution using tidyverse (including code that generates example data).

# Create data frame A
set.seed(42)  # For reproducibility
n_rows_A <- 100

df_A <- data.frame(
  date = rep(Sys.Date(), n_rows_A),
  sym = rep("EUR_ab6m_05y", n_rows_A),
  hour = sample(0:23, n_rows_A, replace = TRUE),
  minute = sample(0:59, n_rows_A, replace = TRUE),
  second = sample(0:59, n_rows_A, replace = TRUE)
)

# Create data frame B with 30 rows
n_rows_B <- 30

# Select 15 random rows from df_A
matching_indices <- sample(n_rows_A, n_rows_B / 2)
df_B_matching <- df_A[matching_indices, ]

# Create 15 rows with different 'second' values
non_matching_indices <- sample(setdiff(1:n_rows_A, matching_indices), n_rows_B / 2)
df_B_non_matching <- df_A[non_matching_indices, ]
df_B_non_matching$second <- sample(0:59, n_rows_B / 2, replace = TRUE)

# Combine matching and non-matching rows for data frame B
df_B <- rbind(df_B_matching, df_B_non_matching)

rownames(df_B) <- NULL

# EXACTLY MATCHING SECONDS
# Create dummy variable indicating whether there is an exact match between date,
# sym, hour, minute and **second** in A, and date, sym, hour, minute and
# **second** in B:
df_A <- dplyr::left_join( df_A
                        , df_B %>% mutate( dummy = 1 )
                        , by = c( "date", "sym", "hour", "minute", "second")
) %>% 
  mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )

It is fully possible to use the same approach to create a dummy variable indicating whether a row in A has a matching row in B within a +/- 5 second margin. However, based on the data in the image you provided, some rows in A would probably have multiple "close-enough" matches in B. This could easily result in duplicates of rows from A.

Assuming you want to keep the A to its original number of rows and add a column to A indicating whether or not there is one or more rows in B that match within a +/- 5 second margin, you could do this:

# ALTERNATIVE 2 - SECONDS WITHIN +/- 5 

library(lubridate)
df_A <- dplyr::inner_join( df_A
                           , df_B 
                           , by = c( "date", "sym") # only use date as a key
                           , suffix = c("", "_b")
) %>% 
  mutate( 
    datetime_a = ymd_hms( paste0( date, " "
                                  , sprintf("%02d", hour), ":"
                                  , sprintf("%02d", minute), ":"
                                  , sprintf("%02d", second)
    ))
    , datetime_b = ymd_hms( paste0( date, " "
                                    , sprintf("%02d", hour_b), ":"
                                    , sprintf("%02d", minute_b), ":"
                                    , sprintf("%02d", second_b)
    ) )
    , diff_seconds = abs( as.integer( difftime( datetime_a, datetime_b, units = "secs") ) )
    
  ) %>% 
  # Remove rows where the difference between datetime_a and datetime_b is
  # greater than 5 seconds:
  filter( diff_seconds <= 5 ) %>%
  # Then add the dummy variable:
  mutate( dummy = 1 ) %>% 
  # Remove any unnecessary columns:
  select( date, sym, hour, minute, second, dummy ) %>% 
  # Add the dummy column to A using a left_join:
  dplyr::left_join( df_A, ., by = c("date", "sym", "hour", "minute", "second")) %>% 
  # Rows without any matches have dummy = NA. Recode:
  mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )
E. Nygaard
  • 104
  • 6
  • This is a very nice solution. But i still get multiple results in dummy. My problem is that in df_B there are certain "second" that are not present in df_A. However, best would be: wherever the "second" in df_B is missining in df_A, create a new mathching row in df_A. #Example: df_B has 04:14:08 and df_A has either 04:14:05 or 04:14:09. this code would give me a dummy =1 to both rows in df_A. Best would be --> if this happens: create a new row with time = 04:14:08 and give it all the values present in 04:14:05, and give dummy = 1 only to new row 04:14:08. How could I do this?# – Mining Aug 28 '23 at 14:03
  • @Mining You should edit your question so that it contains a sample of the actual data you are working with and expresses clearly what it is that you want to achieve. Some more context would also be helpful and make it easier to understand how the code should work. I don't understand the logic behind creating a new row and populating it with data from the row with timestamp 04:14:05, but it might make more sense if you provided more context. – E. Nygaard Aug 28 '23 at 22:11
  • hey @E.Nygaard! Very simply: my df_B contains the date and time of when a certain event happened. I want to see how this event impacted on the different syms. The problem is that the different syms dont have a row for every single second from 2020 to 2023, but rather have a row every say 2 or 3 seconds, and that is because "if a second is skipped it means the values are the same of the second before." Therefore, I need to give a dummy = 1 to the right second, and if it is not there I need to create it. – Mining Aug 29 '23 at 06:23
  • @Mining Comments are meant for leaving constructive criticism or requesting clarification. You are now asking something that is singificantly different from your original question. I suggest that you either 1) accept my answer as the best solution and post a new question, with code that generates example data, where your "new" question is clearly formulated, or 2) edit your original post so that it states what your question actually is, and also includes code that generates example data. – E. Nygaard Aug 31 '23 at 12:30
0

I think you could use dplyr and lubridate functions. Try this:

## load the libraries
library(lubridate)
library(dplyr)

## create df A
dfA <- data.frame(
  date = seq(as.Date("2023-08-25"), by = "days", length.out = 100),
  sym = rep("EUR_ab6m_05y", 100),
  hour = sprintf("%02d", rep(0:23, length.out = 100)),
  minute = sprintf("%02d", rep(0:59, length.out = 100)),
  second = sprintf("%02d", rep(0:59, length.out = 100))
)

## create data.frame B (just a selected part of df A)
dfB <- dfA[c(1:10), ]

## Transform variables to datetime
dfA %>% 
  mutate(new_date = str_glue('{date} {hour}:{minute}:{second}') %>% ymd_hms()) -> dfA

dfB %>% 
  mutate(new_date = str_glue('{date} {hour}:{minute}:{second}') %>% ymd_hms()) -> dfB
dfB$new_date -> list_dates

## Verify the condition
dfA %>% 
  mutate(dummy = ifelse(new_date %in% list_dates, 1, 0)) -> dfA

Most part of the code it's to create the df.