For loop for selecting certain data to form a new data frame

Question

First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.

EDIT: Currently i have this working but it lacks the IF statemtent:

for (i in crimes$category) {
 shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
 names(shoplifting) <- c("ID", "Month", "Street_Name")
}

What i am trying to do:

for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
  }
}

It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..

score 0 · Answer 1 · edited Jan 24 '21 at 15:30

Try:

df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]

Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.

This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.

Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:

df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]

OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).

df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]

Thanks, i understand... and this does work. But the problem is that i HAVE to use a loop and if statement.. — Django, Jan 25 '21 at 13:22

score 0 · Accepted Answer · answered Jan 24 '21 at 15:25

I'll guess, and update if needed based on your question edits.

rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)

One way to work around this is to rbind a just-created data.frame:
```
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
```
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.

If you really just need to do it once for a category, then do this without the need for a for loop:
```
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
```
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).

On this note, if you need one frame per category, you can get that simply with:
```
df_split(df, df$category)
```
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.

Thanks, one thing i did forgot to say is that i need to use a FOR loop with an IF statement together. Bullet point 3 did indeed work with minor tweaks but now i lack the IF statement, i will edit my question and add my current code. — Django, Jan 25 '21 at 13:14
I neither understand the premise of using a `for` loop to iteratively filter/reassign multiple rows of limited categories, nor the need for a `for` loop. Is this homework? If so, I have answered with a canonical-R way of doing it; using a `for` loop is broken in several ways, so possibly I don't understand what you need, *you* don't understand what you need, the assignment is poorly worded, or the instructor either mis-framed the problem or does not understand R. (I'm not trying to be blasphemous to the instructor.) — r2evans, Jan 25 '21 at 13:28
That's what I was afraid of. In *your* `for` loop, you are adding "**all** matching rows", not "*this* matching row". Consider iterating over the row number (`seq_len(nrow(crimes))`) instead of the `$category`, and testing each `crimes$category[ind]` individually. When you found a match, create a new (temporary) dataframe, then `rbind` that 1-row temp frame to the `shoplifting` frame. Hope that helps. — r2evans, Jan 25 '21 at 13:53
Thanks man, i appreciate your help. Iterating over the row makes sense now! I just used rbind after the if statement and it works for me, only problem is that is does duplicate after i run it again. — Django, Jan 25 '21 at 14:27
*"run it again"* ... are you thinking that it will only add a row if that row has not been previously added to `shoplifting`? — r2evans, Jan 25 '21 at 14:29
You have two different methods of logic going on here: (1) Find rows that are "shoplifting" and add to another; I think we've resolved this by iterating per-row and using `rbind` (albeit inefficient); (2a) Only add that row if missing; **or** (2b) Remove duplicate rows. #2a can be done by checking `!crimes$ID[ind] %in% shoplifting$id`. #2b can be done (even less efficiently) with `shoplifting <- shoplifting[!duplicated(shoplifting),]`. — r2evans, Jan 25 '21 at 17:25
But at this point we're heading into "spiral question" territory. I suggest you accept one of the provided answers, then work on your code with this last discussion, and come back (new question) when needed. — r2evans, Jan 25 '21 at 17:26
Yeah I understand. Thanks for all your help, you resolved all of my problems here! — Django, Jan 25 '21 at 17:55

For loop for selecting certain data to form a new data frame

2 Answers2