Multiple String splits in R, after a particular character value

Question

I have a dadtaframe with a genre column with data like this :

genre:

[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]

I need something this way, the word after the string "'name':":

genre1    |   genre2   |  genre3 
Animation |   Comedy   |  Family 
Adventure |   Fantasy  |  Family 
Comedy    |   Drama    |  Romance

I tried str_split_fixed option, but the result isn't as expected. Any direction would help.

Question has nothing to do with `machine-learning` - kindly do not spam the tag (removed) — desertnaut, Jul 19 '18 at 16:20
This appears to be incorrect [ndjson](http://ndjson.org/), namely it is using single-quotes instead of double. This can be compensated for. Is the data in a file or in a vector? — r2evans, Jul 19 '18 at 16:33
Its in vector format. Its in a dataframe and genre is a column in the dataframe. jsonlite didnt work , it threw the error as you conveyed @r2evans. — Varun kadekar, Jul 19 '18 at 16:38

score 0 · Accepted Answer · answered Jul 19 '18 at 18:52

This appears to be incorrect ndjson, since it is using single-quotes. But this is otherwise usable, so we can parse it as JSON (with a fix); trying to parse JSON with regular expressions will likely give pain and frustration and regret.

Data:

vec <- c("[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",
"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]")

Brute-force with Regular Expressions

I'm not even going to attempt it: Regex for parsing single key: values out of JSON in Javascript. Other Q/As exist for other languages that all say the same thing: do not attempt. You can contrive a regex that handles "perfect" and "identically-structured" json, but as soon as valid-but-different json structuring comes along, you're done.

Avoid this.

(Using any string-split function is effectively attempting vanilla/fixed regular expression work. I'm not trying to start a flame-war over the semantics of str_split_fixed, strsplit, ... versus complex regex operations, I think the end-result is that there are great json-parsers out there, and they are much better/faster/more-robust than any string-splitter we can concoct.)

Brute-force with `jsonlite::fromJSON`

The first cut at code just processes the vector as-is. It's a little inefficient in that it is calling fromJSON independent for each one. If your vector is short then this might not be a problem. If this is taking a while, you might want to proceed to the use of stream_in. (Note that this is not treating it like ndjson.)

ret <- lapply(gsub("'",'"',vec), jsonlite::fromJSON)
str(ret)
# List of 4
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 16 35 10751
#   ..$ name: chr [1:3] "Animation" "Comedy" "Family"
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 12 14 10751
#   ..$ name: chr [1:3] "Adventure" "Fantasy" "Family"
#  $ :'data.frame': 2 obs. of  2 variables:
#   ..$ id  : int [1:2] 10749 35
#   ..$ name: chr [1:2] "Romance" "Comedy"
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 35 18 10749
#   ..$ name: chr [1:3] "Comedy" "Drama" "Romance"

To extract just the names, this works:

lapply(ret, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy"    "Family"   
# [[2]]
# [1] "Adventure" "Fantasy"   "Family"   
# [[3]]
# [1] "Romance" "Comedy" 
# [[4]]
# [1] "Comedy"  "Drama"   "Romance"

Note that your columnar output format won't work with the sample you gave me without extending each vector to be the same length (if so, see https://stackoverflow.com/a/34570893/3358272).

ret2 <- lapply(ret, `[[`, "name")
ret2 <- lapply(ret2, `length<-`, max(lengths(ret2)))
do.call(rbind, ret2)
#      [,1]        [,2]      [,3]     
# [1,] "Animation" "Comedy"  "Family" 
# [2,] "Adventure" "Fantasy" "Family" 
# [3,] "Romance"   "Comedy"  NA       
# [4,] "Comedy"    "Drama"   "Romance"

(This is a matrix, you can take it to the next level.)

Slightly faster with `jsonlite::stream_in`

(This is treating your data like NDJSON. It's a nuance, not critical.)

If you need to speed this up a little, you can use jsonlite::stream_in either on the original raw file (preferred) or, if you don't have it in a raw file, then we'll textConnectionize things are fake it (slightly less efficient than a raw file, but should still be faster than jsonlite::fromJSON).

jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")))
#  Imported 4 records. Simplifying...
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
#   arguments imply differing number of rows: 3, 2

I kept this error here to demonstrate that jsonlite's default is to try to convert simple lists into frames. It doesn't always work, and in this case it is trying to rbind them, and the third element (with one fewer element) is causing some problem. We can mitigate this by turning off frame-ification.

str(
  ret <- jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")),
                             simplifyDataFrame=FALSE)
)
#  Imported 4 records. Simplifying...
# List of 4
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 16
#   .. ..$ name: chr "Animation"
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#   ..$ :List of 2
#   .. ..$ id  : int 10751
#   .. ..$ name: chr "Family"
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 12
#   .. ..$ name: chr "Adventure"
#   ..$ :List of 2
#   .. ..$ id  : int 14
#   .. ..$ name: chr "Fantasy"
#   ..$ :List of 2
#   .. ..$ id  : int 10751
#   .. ..$ name: chr "Family"
#  $ :List of 2
#   ..$ :List of 2
#   .. ..$ id  : int 10749
#   .. ..$ name: chr "Romance"
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#   ..$ :List of 2
#   .. ..$ id  : int 18
#   .. ..$ name: chr "Drama"
#   ..$ :List of 2
#   .. ..$ id  : int 10749
#   .. ..$ name: chr "Romance"

This is a little different than the first attempt, but not at all hard to extract.

lapply(ret, sapply, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy"    "Family"   
# [[2]]
# [1] "Adventure" "Fantasy"   "Family"   
# [[3]]
# [1] "Romance" "Comedy" 
# [[4]]
# [1] "Comedy"  "Drama"   "Romance"

(Using the same steps to column-ize things as above.)

This call may seem odd in that it is a double-*apply call. It is equivalent to (but shorter than):

lapply(ret, function(x) sapply(x, `[[`, "name"))

Thanks much. This solution worked. I am making progress in my use case! — Varun kadekar, Jul 20 '18 at 04:23

Multiple String splits in R, after a particular character value

1 Answers1

Brute-force with Regular Expressions

Brute-force with jsonlite::fromJSON

Slightly faster with jsonlite::stream_in

Brute-force with `jsonlite::fromJSON`

Slightly faster with `jsonlite::stream_in`