Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID
to df
that identifies the row by its row number that we can later dplyr::group_by
to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
- replicate each row of stolen cars
N
times where N
is the number of not-stolen car rows,
- replicate the rows of not-stolen cars (as a block)
M
times where M
is the number of stolen car rows, and
- append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition
The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr
piped R workflow that is the fourth line of the code (wrapped in a print()
) does this:
- the first command replicates the not-stolen car rows using
times
- the second command appends
_not
to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
- the third command replicates the stolen car rows using
each
and appends the previous result as new columns using dplyr::bind_cols
- the fourth command uses
dplyr::mutate
to create a new column named in_range
that is the result of applying the condition. The boolean result is converted to {0,1}
to allow for easy accumulation
- the rest of the commands in the pipe does the counting of
in_range
grouped by the ID
and arranging the results in decreasing order of the count. Note that now ID
is the column that identifies the rows of the original data frame for which is_stolen = 1
whereas ID_not
is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1
in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no
that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7
new rows with is_stolen = 1
to your original 6
rows that are all is_stolen = 0
:
- the first added row with
telematic_trip_no = 526005880
violates the longitude condition for all not-stolen rows, so its count should be 0
- the second added row with
telematic_trip_no = 526006880
violates the latitude condition for all not-stolen rows, so its count should be 0
- the third added row with
telematic_trip_no = 526007880
violates the telematic_trip_no
condition for all not-stolen rows, so its count should be 0
- the fourth added row with
telematic_trip_no = 526006890
satisfies the condition for rows 4
and 5
that are not-stolen, so its count should be 2
- the fifth added row with
telematic_trip_no = 526106880
satisfies the condition for row 6
that is not-stolen, so its count should be 1
- the sixth added row with
telematic_trip_no = 526017880
violates the time_of_day
condition for all not-stolen rows, so its count should be 0
- the seventh added row with
telematic_trip_no = 526027880
violates the day_of_week
condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1
starts at row 7
with ID = 7
.
If one were to group by telematic_trip_no
instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4
where N
is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.