0

I'm an R beginner and would really appreciate your help with a piece of code I'm struggling with...

I have been working with a data set for a while now and after finishing a large chunk of new code I wanted to re-run the script. It all seemed to work fine until I noticed that R no longer recognised the variable names of the data sets I imported (even though none of the code changed and it used to work absolutely fine!).

Here is an overview of the data set I'm using, I imported it from an Excel file:

glimpse(ELFS2)
Rows: 227,727
Columns: 18
Groups: ID [5,208]
$ Cohort        <chr> "Study 2 - Condition 0", "Study 2 - Condition 0", "Study 2 - Condition 0", "Study …
$ ID            <chr> "ID0103", "ID0103", "ID0103", "ID0103", "ID0103", "ID0103", "ID0103", "ID0103", "I…
$ Action        <chr> "AddToTrolley", "AddToTrolley", "AddToTrolley", "AddToTrolley", "AddToTrolley", "A…
$ Quantity      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Product       <chr> "Strawberries 300G", "Organic British Semi Skimmed Milk 1.136L, 2 Pint", "Tilda Ba…
$ Price         <dbl> 2.50, 0.89, 2.00, 3.30, 4.00, 0.70, 0.70, 0.85, 2.50, 1.60, 1.90, 1.00, 20.54, 2.5…
$ EnergyKCAL    <dbl> 125.52, 209.20, 1491.00, 1111.00, 2558.00, 2400.00, 2400.00, 2140.00, 1075.00, 654…
$ EnergyKJ      <dbl> 125.52, 209.20, 1491.00, 1111.00, 2558.00, 2400.00, 2400.00, 2140.00, 1075.00, 654…
$ Fat           <dbl> 0.1, 1.8, 0.8, 20.0, 49.3, 36.0, 36.0, 26.1, 13.0, 4.4, 33.7, 7.1, NA, 0.1, 1.8, 0…
$ SaturatedFat  <dbl> 0.1, 1.1, 0.2, 3.3, 9.8, 21.0, 21.0, 15.6, 4.5, 1.2, 22.2, 1.7, NA, 0.1, 1.1, 0.2,…
$ Carbohydrates <dbl> 6.0, 4.8, 77.7, 0.5, 20.5, 53.0, 53.0, 63.4, 25.0, 21.9, 2.9, 83.0, NA, 6.0, 4.8, …
$ Sugar         <dbl> 6.0, 4.8, 0.5, 0.5, 5.4, 49.0, 49.0, 47.5, 2.1, 4.1, 0.4, 5.2, NA, 6.0, 4.8, 0.5, …
$ Fibre         <dbl> 1.1, 0.0, 1.0, 0.5, 6.1, 0.0, 0.0, 0.0, 0.0, 1.1, 0.6, 2.5, NA, 1.1, 0.0, 1.0, 0.5…
$ Protein       <dbl> 0.8, 3.6, 7.8, 21.5, 19.8, 8.2, 8.2, 6.6, 10.0, 6.5, 21.9, 6.5, NA, 0.8, 3.6, 7.8,…
$ Salt          <dbl> 0.01, 0.10, 0.03, 0.33, 0.86, 0.19, 0.19, 0.59, 0.98, 0.49, 1.40, 2.00, NA, 0.01, …
$ ProductWeight <chr> "300g", "1136ml", "500g", "240g", "350g", "30g", "30g", "43g", "335g", "400g", "15…
$ Approval      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ Approval1     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> 

I've noticed that whilst entering a variable in the code, R will still suggest the variable as usual in a drop-down menu. However, when I select the variable from the suggestions, R enters it with "", as if it were a character:

ELFS2[, "Approval"]

For the following piece of code it doesn't return any error, but it doesn't perform the task. The code used to create a new variable called Approval1 which would have a '1' whenever there was a '1' in the variable Approval in any of the rows for each participant. Now, it creates the new variable Approval1, but this variable contains only NAs:

ELFS2 <- ELFS2 %>%
  group_by(ID) %>%
  mutate(Approval1 = ifelse(sum(Approval)>0, 1, 0))

The following code should remove all rows for which the variable 'Fat' is unequal 1. However, when I run the code it returns an error message, telling me that the variable is not found at all:

ELFS2.1 <- ELFS2[Fat == 1]
Error in `[.tbl_df`(ELFS2, Fat == 1) : object 'Fat' not found

I thought the variables might not be correctly classified but it all seems correct to me?

The problem relates to all variables as far as I can see. Can anyone make sense of this? I would really appreciate some help! Many many thanks in advance!

Ke_Fr
  • 5
  • 4
  • Thanks @r2evans for the quick comment. Sorry if my description wasn't very clear. The data set contains all of the variables, Approval1 is not a separate vector (as you can see in the glimpse command above). The data set was imported from Excel in one go, the only new variable is Approval1. With regards to the data class, I had previously changed it to data table using data.table(). When I look up the class of the data frame ELFS2, I get the following: ```> class(ELFS2) [1] "grouped_df" "tbl_df" "tbl" "data.frame"``` – Ke_Fr Sep 16 '20 at 18:21
  • (1) A `glimpse` of a `data.frame` tells us nothing of other variables in the environment. (2) On your use of `ELFS[Fat == 1]` ... in my answer, replace all references to `Approval1` with `Fat` and it still applies. Perfectly. The error `object 'Fat' not found` supports this. Again (but adapted), try `ELFS2[ELFS2$Fat == 1,]` or `dplyr::filter(ELFS2, Fat == 1)`. – r2evans Sep 16 '20 at 18:45
  • BTW, it looks like you're doing an equality test with floating-point numbers. Try `(1-1e-6) == sqrt(1-1e-6)^2` and read https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal. You may consider using *tolerance*, something akin to `abs(Fat - 1) < 1e-6` instead of `Fat == 1`. (Floating-point equality works perfectly until it doesn't ... but it'll never tell you that it is acting unintuitively, it just ... isn't equal when you think it should be, over-filtering or such.) – r2evans Sep 16 '20 at 19:05

1 Answers1

0

ELFS2[Approval1 == 1] will only work in one of two situations:

  1. It is a data.table with a column named Approval1. Seems unlikely given your question (and the error's reference to `[.tbl_df`).

  2. There exists both ELFS2 (which is your frame) and a variable named Approval1 in the environment or search path. It seems likely that Approval1 is a vector defined so that you could then create your ELFS2 frame, as in

    Approval1 <- c(1, 0, 0, 1)
    ELFS2 <- data.frame(Approval1)
    

    This non-framed variable is not updated anywhere else, so if/when ELFS2 changes (filters, mutates, reorders), Approval1 references are no longer paired correctly. For instance,

    ELFS2 %>%
      filter(...) %>%
       ***
    

    On line 1, all(ELFS2$Approval1 == Approval1). However, since line 2 changes the effective frame, when we get to line 3, it is possible that the effective frame is no longer the same length (number of rows) as the length of the original ELFS2. This is equivalent to trying to pair c(1,2,4) with c(1,2,3,4) ... they are unequal lengths, so the pairing no longer holds. Moreso if there has been any reordering or if the frame's Approval1 content changes.

Perhaps you meant either ELFS2[ELFS2$Approval1 == 1,] or dplyr::filter(ELFS2, Approval1 == 1)? Both work with tibbles and data.frame-class objects that have a column named Approval1

r2evans
  • 141,215
  • 6
  • 77
  • 149