0

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,

out <- df[df$`Var name` == "string",] 

I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails

> df[i,j]
[1] "string"
df[i,j]=="string"  # pasted from above line

I don't understand how I can be explicitly pasting the output I was just given and it not match.

## attempts to get exact string to paste into subset statement    
# from dput 
"IF APPLICABLE – Which of the following best characterizes the expectations with"

# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"

# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with

# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with

I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?

edit:

I subset based on another way, below is a dput and actual example of what I'm doing:

> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with", 
    `Item Response` = "It was required.", orgchar_group = "locale", 
    `Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475, 
    `Graphs note` = NA_character_, `Report note` = NA_character_, 
    `Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_, 
    statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem", 
"Item Response", "orgchar_group", "Org Characteristic", "N", 
"percent", "Graphs note", "Report note", "Other note", "subsig", 
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE
cparmstrong
  • 799
  • 6
  • 23
  • 1
    Maybe the original has non printable characters in it. – Rui Barradas Jan 22 '18 at 18:13
  • 1
    It must be system specific. I ran your code on my windows machine and on tio.run and it evaluates as TRUE. – Mark Jan 22 '18 at 18:16
  • 1
    Works for me. You'll have to come up with an example that actually fails, I guess. – Roman Luštrik Jan 22 '18 at 18:16
  • 1
    https://tio.run/##vVSxbtswEN3zFQdNDiAIliLbypDBDdrCgOEG7pAhKGyGPkkEGFE9UnGQqf/QP@yPOEfJcGMkCjK04UDoHY/3nu4eSLudw7saLsA6aqRrCAdaWTdYzzgO33lb82Ew@wLTq6v57HL6af4Z/vz6DdelkiWYHFyJkButzVZVBdyidSBLQUI6JPWItk3AhxqlE06ZysJWuTII4QReXR3zEm3NudixO9gKC4Q/G0W4ifiyocKzrAoyjZcfaCOFxjfKfqMCLg/CrFOyrb1sSGi@tmCQZpMQaiSJlWN4lkajOB0l2WR4Nkwno/7aX0nUpYXKuFbwYro6tGAVwnqJtSHXd9wrmBtHfZdsc2tVwfGYe3GP/A@awTCEe0GVuMN301ieSleJ8wmF5tTG4hHeqDxHwkr68HkUj4dpmiVxcp6Mx6PTEKIFM1o@k4Pg4Jv@UfzjFRwZhmmDI3P4wMvZ@@ji4yTuTeVZn3nFw2feaJUehu5RN@WPk7l3kqfe@6hV0VnEfzZdg/8aIuDxk9lG1d4BSZbNQ5BaWI@CjXAiyskXOj3xL81NHMY/4OK/PCm73RM – Mark Jan 22 '18 at 18:17
  • Reading up on non-printable characters. Based on the fact it works for two of you as pasted above I've got a feeling that may be it. Will upate if/when I figure out that's it. – cparmstrong Jan 22 '18 at 18:20

1 Answers1

0

Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.

I was able to figure it out using insights from here (& here) and here.

I used a grep command (from @Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from @hadley) to determine what kind. I then used base solution from @Josh O'Brien to remove it. Turns out it was the heiphen.

# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")  
[1] "IF APPLICABLE  Which of the following best characterizes the expectations with"

# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")

Still don't understand it enough to explain why it happened but it's fixed now.

cparmstrong
  • 799
  • 6
  • 23