Data wrangling problem with labelled sound files

Question

Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start" 
names(sound_df)[3] <- "sound_end"
print(sound_df)

    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

score 1 · Accepted Answer · answered May 29 '23 at 13:07

Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

library(data.table)
setDT(sound_df)
sound_df[, `:=`(
    sound_start_for_join = as.numeric(sound_start),
    sound_end_for_join = as.numeric(sound_end)
)]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

CHUNK_SECONDS <- 3
FILE_SECONDS <- 15
time_windows <- CJ(
    soundfile = sound_df$soundfile,
    start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
    unique = TRUE
)[, end := start + CHUNK_SECONDS][
    ,
    `:=`(
        start_for_join = start,
        end_for_join = end
    )
]

time_windows

#       soundfile start   end start_for_join end_for_join
#          <char> <num> <num>          <num>        <num>
#  1: soundfile_1     0     3              0            3
#  2: soundfile_1     3     6              3            6
#  3: soundfile_1     6     9              6            9
#  4: soundfile_1     9    12              9           12
#  5: soundfile_1    12    15             12           15
# <etc>

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

sound_out <- sound_df[
    time_windows,
    on = .(
        soundfile,
        sound_start_for_join < end_for_join,
        sound_end_for_join > start_for_join
    )
][, .(
    soundfile, start, end,
    present = !is.na(sound_start)
)]

Which produces the desired output:

      soundfile start   end present
         <char> <num> <num>  <lgcl>
 1: soundfile_1     0     3    TRUE
 2: soundfile_1     3     6    TRUE
 3: soundfile_1     6     9    TRUE
 4: soundfile_1     9    12    TRUE
 5: soundfile_1    12    15    TRUE
 6: soundfile_2     0     3   FALSE
 7: soundfile_2     3     6    TRUE
 8: soundfile_2     6     9   FALSE
 9: soundfile_2     9    12    TRUE
10: soundfile_2    12    15    TRUE
11: soundfile_3     0     3   FALSE
12: soundfile_3     3     6    TRUE
13: soundfile_3     6     9   FALSE
14: soundfile_3     9    12   FALSE
15: soundfile_3    12    15    TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no" you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").

That's brilliant, thanks - yes of course a logical vector is easier to work with. — davidj444, May 29 '23 at 13:32

Data wrangling problem with labelled sound files

1 Answers1