-3

I have a huge data set with 500 000 entries like the one below.

Country Region      Code Product name Year Value
Sweden  Stockholm   123  Apple        1991 244   
Sweden  Kirruna     123  Apple        1987 100
Japan   Kyoto       543  Pie          1987 544
Denmark Copenhagen  123  Apple        1998 787
Denmark Copenhagen  123  Apple        1987 100
Denmark Copenhagen  543  Pie          1991 320
Denmark Copenhagen  126  Candy        1999 200
Sweden  Gothenburg  126  Candy        2013 300
Sweden  Gothenburg  157  Tomato       1987 150
Sweden  Stockholm   125  Juice        1987 250
Sweden  Kirruna     187  Banana       1998 310
Japan   Kyoto       198  Ham          1987 157
Japan   Kyoto       125  Juice        1987 550
Japan   Tokyo       125  Juice        1991 100

From that I want to conditionally all the rows that have a code corresponding to the values in a vector with a bunch of code. Like this one:

A <- c(123, 157, 543)

That is, the rows with Apple, Tomato, and Pie should all be removed based on the code in the vector.

I found this solution here to remove a row based on a specific column value: Conditionally Remove Dataframe Rows with R But what I want to do now is to remove rows based on not just one single value, but a vector with several values.

Sotos
  • 51,121
  • 6
  • 32
  • 66
KGB91
  • 630
  • 2
  • 6
  • 24

2 Answers2

3

Using sqldf:

sqldf("select * from df where Code NOT IN (123,157,543)")
Saurabh Chauhan
  • 3,161
  • 2
  • 19
  • 46
  • Ah, you can use SQL-code in R, that was new to me (I started with R this week). Thanks! – KGB91 Sep 19 '18 at 07:29
1

We can use %in% to create a logical vector and negate (!) it within subset (without using any packages)

subset(df1, !Code %in% A)

Or using filter from dplyr

library(dplyr)
df1 %>%
    filter(!Code %in% A)

If we want to work with the new data frame, we can either write the value back over df1 (which will completely replace the previous (unfiltered) version of the data frame) or write it to a new data frame (df2) and work with that. It would be better to opt for writing the values to a new data frame just in case we do something unintended and lose the data

Like this:

library(dplyr)

df2 <- df1 %>%
    filter(!Code %in% A)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! That seems to work just fine. If I then want to work with that new subset data to do other stuff, like summarizing values, do I have to use the `subset` code every time, or is there some way to manipulate the imported data in R so that `df1` itself does no longer contain the values I want to remove? – KGB91 Sep 19 '18 at 07:25
  • @KGB91 I've added some further information to this answer that should help answer your question. – Mus Sep 19 '18 at 09:47
  • 1
    @MusTheDataGuy Thanks a lot! Also sry for double posting, did not find that other post from 2013. – KGB91 Sep 19 '18 at 11:26