How to clean large datasets using for loops

Question

I've been banging my head against the wall with this problem, and I would love your help. To illustrate, I've created two dummy datasets (data A and data B).

dataA <- data.frame(TimeofLife = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                    TimeofDeath = c(4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18))
dataB <- data.frame(Time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                    Value = c(500, 510, 520, 530, 540, 500, 510, 520, 530, 540, 500, 510, 520, 530, 540))

Below are the first 6 values of dataA

  TimeofLife TimeofDeath
1          1           4
2          2           5
3          3           6
4          4           7
5          5           8
6          6           9

and dataB

   Time Value
1    1   500
2    2   510
3    3   520
4    4   530
5    5   540
6    6   500

I want to use the information I have in dataA to create a subset in dataB. For example, the first row of dataA has the values (1, 4). Using these as the range of values, I want to create a subset for dataB that looks like this:

  Time Value
1    1   500
2    2   510
3    3   520
4    4   530

Now, the issue that I am having is that in my real dataset, I need to create many subsets (thousands+), and I cannot do this manually. That is, I cannot hard-code for each subset because that'd be a nightmare and error-prone.

What I'd like to do is create a for loop which I can use to extract information that I need from dataA and use it to create subsets in dataB.

The logic of the code I want looks something like this (this is nonsense, bc I don't know how to code it):

for(row i in dataA)
  {find the values of TimeofLife and Time of Death in row dataA[i],
  then use those values to filter a subset of dataB, 
  extract and save this subset as a dataframe}

Thank you so much.

Please see how to provide a [reproducible minimal example](https://stackoverflow.com/q/5963269/8107362). Especially, it is much easier to help if you provide some ready to use sample data, e.g. with `dput()`. Or, evem better. try to completely abstract your problem on a dummy data set. More often than not, you will find a solution if you try to abstract your problem — mnist, May 22 '20 at 23:42
Okay, rewrote the question. Hopefully that suffices? Thanks for letting me know btw. I'm new to this community. — Drew, May 23 '20 at 00:23

score 2 · Accepted Answer · answered May 23 '20 at 00:43

2

Consider element wise looping with mapply or its wrapper Map to build a list of data frames:

proc_subset <- function(x, y) {
   dataB[dataB$Time >= x & dataB$Time <= y,]
}

# BUILD LIST OF SUBSET DATA FRAMES
subset_list <- Map(proc_subset, dataA$TimeofLife, dataA$TimeofDeath)

# NAME EACH ELEMENT
data_names <- paste("data", dataA$TimeofLife, dataA$TimeofDeath, sep="_")
subset_list <- setNames(subset_list, data_names)


# OUTPUT EACH DF ELEMENT
subset_list$data_1_4
subset_list$data_2_5
subset_list$data_3_6
...

answered May 23 '20 at 00:43

Parfait

104,375
17
94
125

This has been so incredibly helpful. You've saved me from a lot of grief. Thank you so much! – Drew May 25 '20 at 19:03
No problem. Great to hear and glad to help! – Parfait May 26 '20 at 00:23

score 1 · Answer 2 · answered May 23 '20 at 01:16

If you allow packages, data.table allows non-equi joins. This utilizes by = .EACHI to group by each row as indicated.

library(data.table)
setDT(dataA)
setDT(dataB)

dataB[dataA,
      on = .(Time >= TimeofLife,
             Time <= TimeofDeath),
      j = .(Values = .(Value)),
      by = .EACHI,
      allow.cartesian = TRUE
      ]

#     Time  Time          Values
#    <num> <num>          <list>
# 1:     1     4 500,510,520,530
# 2:     2     5 510,520,530,540
# 3:     3     6 520,530,540,500
# 4:     4     7 530,540,500,510
# 5:     5     8 540,500,510,520
# 6:     6     9 500,510,520,530
# 7:     7    10 510,520,530,540
# 8:     8    11 520,530,540,500
# 9:     9    12 530,540,500,510
#10:    10    13 540,500,510,520
#11:    11    14 500,510,520,530
#12:    12    15 510,520,530,540
#13:    13    16     520,530,540
#14:    14    17         530,540
#15:    15    18             540

How to clean large datasets using **for** loops

2 Answers2

How to clean large datasets using for loops