0

I have dataframe like this

situations <- c("{17: '', 80: '', 55: '693', 29: '', 103: '19', 233: '872', 6: '', 20: '', 230: '99.3', 215: '', 102: '47.7', 56: 'Center', 146: '85.1', 147: '40.6', 23: '', 231: '47.8'}", "{103: '1.9', 18: '', 154: '', 147: '48.6', 22: '', 233: '879', 76: '', 459: '', 55: '719', 29: '', 102: '54.2', 56: 'Center', 328: '', 146: '94.7', 20: ''}", "{215: '', 22: '', 56: 'Center', 233: '731', 103: '19', 78: '', 230: '97.7', 146: '78.2', 20: '', 102: '50.4', 29: '', 18: '', 55: '899', 147: '43.3', 82: '', 231: '48.7'}")

events <- c("A", "B", "C")

df <- data.frame(situations, events)

And I want to filter out rows containing 6 (or 6: ''). Could you help me please?

I've tried grepl to select rows, but output is not desired.

df$filter <- as.integer(grepl('6:', df$situations))

It leaves all values containing 6. Like 76, 146, 56 etc

In Python I use such simple code for this task, but cannot find something similar in R.

df['is_ok'] = df['situations'].apply(lambda x: True if 6 in x else False)

Solution

I've found the desired output thanks to @MrFlick

as.integer(grepl('\\b6:', df$situations))

Thanks everyone

Delopera
  • 75
  • 6
  • 2
    Is that JSON? You need to parse it first and then filter. Have a look at [this](https://stackoverflow.com/questions/2061897/parse-json-with-r) – Sotos Mar 20 '23 at 14:31
  • 1
    Do you want to filter the rows out or do you want to create a new column with an indicator value? There is no `df$qualifiers` in your example. Did you mean `df$situations `? It seem the data in python must be in a different format. Like you have actually parsed the string into a dict but in R you just have a string. You could use `as.integer(grepl('\\b6:', df$situations))` to make sure the "6:" is proceeded by a word boundary – MrFlick Mar 20 '23 at 14:33
  • @MrFlick ahhh yes. That looks more like a python dictionary rather than JSON. In that case read it in a pandas and save it as csv? – Sotos Mar 20 '23 at 14:36
  • @Sotos I don't think it is JSON, otherwise the keys would have to be strings? I think MrFlick is correct that it's a dict from Python that hasn't been converted to a native R data type like a list. – joran Mar 20 '23 at 14:36
  • It's not legal JSON, but it can be coerced easily. What is your expected output? It's not really a frame-like object; at best it's a list of frames, each with 1 row and 15-16 columns. – r2evans Mar 20 '23 at 14:37
  • @MrFlick yes, I mean df$situations (already edited). Sorry. Actually I want to filter the rows. I've tried to create new column with an indicator value as first step of filtering. Maybe there is more simple way. I've tried your suggestion but it returns with 0 rows with indicator 1. – Delopera Mar 20 '23 at 14:44
  • @Sotos no, it's not JSON. I have large csv with thousand of rows and 15 columns. "situations" – just one of this columns. – Delopera Mar 20 '23 at 14:45
  • @r2evans I want to exclude rows containing 6 (or 6: '' to be precise) – Delopera Mar 20 '23 at 14:48
  • @Delopera fixed, see the edit to my answer – r2evans Mar 20 '23 at 14:55
  • So something like this? `df %>% separate_rows(situations, sep = ', ') %>% separate(situations, into = c('v1', 'v2'), sep = ':') %>% filter(v1 != 6 & v2 != '') ` ... (Unpolished) – Sotos Mar 20 '23 at 15:01
  • @Delopera When I use `as.integer(grepl('\\b6:', df$situations))` on the sample in your question I get `1 0 0` as output. Are you sure you don't get any 1's? Did you copy/paste the input above? – MrFlick Mar 20 '23 at 15:09
  • @MrFlick sorry, my fault. It works perfect. Thank you. – Delopera Mar 20 '23 at 15:18

2 Answers2

1

It's not truly JSON, but we can coerce it with a little regex.

situations |>
  gsub("'", '"', x = _) |>
  gsub("([^ {,]+):", '"\\1":', x = _) |>
  lapply(jsonlite::fromJSON) |>
  Filter(function(z) !"6" %in% names(z), x = _) |>
  str()
# List of 2
#  $ :List of 15
#   ..$ 103: chr "1.9"
#   ..$ 18 : chr ""
#   ..$ 154: chr ""
#   ..$ 147: chr "48.6"
#   ..$ 22 : chr ""
#   ..$ 233: chr "879"
#   ..$ 76 : chr ""
#   ..$ 459: chr ""
#   ..$ 55 : chr "719"
#   ..$ 29 : chr ""
#   ..$ 102: chr "54.2"
#   ..$ 56 : chr "Center"
#   ..$ 328: chr ""
#   ..$ 146: chr "94.7"
#   ..$ 20 : chr ""
#  $ :List of 16
#   ..$ 215: chr ""
#   ..$ 22 : chr ""
#   ..$ 56 : chr "Center"
#   ..$ 233: chr "731"
#   ..$ 103: chr "19"
#   ..$ 78 : chr ""
#   ..$ 230: chr "97.7"
#   ..$ 146: chr "78.2"
#   ..$ 20 : chr ""
#   ..$ 102: chr "50.4"
#   ..$ 29 : chr ""
#   ..$ 18 : chr ""
#   ..$ 55 : chr "899"
#   ..$ 147: chr "43.3"
#   ..$ 82 : chr ""
#   ..$ 231: chr "48.7"

(Requires R >= 4.2 for the _ special argument.)

If you need numbers to be numbers, we can use type.convert:

situations |>
  gsub("'", '"', x = _) |>
  gsub("([^ {,]+):", '"\\1":', x = _) |>
  lapply(function(z) type.convert(jsonlite::fromJSON(z), as.is=TRUE)) |>
  Filter(function(z) !"6" %in% names(z), x = _) |>
  str()
# List of 2
#  $ :List of 15
#   ..$ 103: num 1.9
#   ..$ 18 : logi NA
#   ..$ 154: logi NA
#   ..$ 147: num 48.6
#   ..$ 22 : logi NA
#   ..$ 233: int 879
#   ..$ 76 : logi NA
#   ..$ 459: logi NA
#   ..$ 55 : int 719
#   ..$ 29 : logi NA
#   ..$ 102: num 54.2
#   ..$ 56 : chr "Center"
#   ..$ 328: logi NA
#   ..$ 146: num 94.7
#   ..$ 20 : logi NA
#  $ :List of 16
#   ..$ 215: logi NA
#   ..$ 22 : logi NA
#   ..$ 56 : chr "Center"
#   ..$ 233: int 731
#   ..$ 103: int 19
#   ..$ 78 : logi NA
#   ..$ 230: num 97.7
#   ..$ 146: num 78.2
#   ..$ 20 : logi NA
#   ..$ 102: num 50.4
#   ..$ 29 : logi NA
#   ..$ 18 : logi NA
#   ..$ 55 : int 899
#   ..$ 147: num 43.3
#   ..$ 82 : logi NA
#   ..$ 231: num 48.7
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Sorry for not clarifying this in the question but I have another columns in my dataframe. And I would not like to add another 15-16. Because of 6 can be on different positions. My expected output is dataframe without rows with 6 in "situations". – Delopera Mar 20 '23 at 15:33
0

Probably you can try py_eval from reticulate to parse the string vector situations, e.g.,

library(reticulate)
lapply(
  situations,
  function(x) {
    d <- py_eval(x)
    d[endsWith(names(d), "6")]
  }
)

and you will obtain

[[1]]
[[1]]$`6`
[1] ""

[[1]]$`56`
[1] "Center"

[[1]]$`146`
[1] "85.1"


[[2]]
[[2]]$`76`
[1] ""

[[2]]$`56`
[1] "Center"

[[2]]$`146`
[1] "94.7"


[[3]]
[[3]]$`56`
[1] "Center"

[[3]]$`146`
[1] "78.2"
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81