This appears to be incorrect ndjson, since it is using single-quotes. But this is otherwise usable, so we can parse it as JSON (with a fix); trying to parse JSON with regular expressions will likely give pain and frustration and regret.
Data:
vec <- c("[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",
"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]")
Brute-force with Regular Expressions
I'm not even going to attempt it: Regex for parsing single key: values out of JSON in Javascript. Other Q/As exist for other languages that all say the same thing: do not attempt. You can contrive a regex that handles "perfect" and "identically-structured" json, but as soon as valid-but-different json structuring comes along, you're done.
Avoid this.
(Using any string-split function is effectively attempting vanilla/fixed regular expression work. I'm not trying to start a flame-war over the semantics of str_split_fixed
, strsplit
, ... versus complex regex operations, I think the end-result is that there are great json-parsers out there, and they are much better/faster/more-robust than any string-splitter we can concoct.)
Brute-force with jsonlite::fromJSON
The first cut at code just processes the vector as-is. It's a little inefficient in that it is calling fromJSON
independent for each one. If your vector is short then this might not be a problem. If this is taking a while, you might want to proceed to the use of stream_in
. (Note that this is not treating it like ndjson.)
ret <- lapply(gsub("'",'"',vec), jsonlite::fromJSON)
str(ret)
# List of 4
# $ :'data.frame': 3 obs. of 2 variables:
# ..$ id : int [1:3] 16 35 10751
# ..$ name: chr [1:3] "Animation" "Comedy" "Family"
# $ :'data.frame': 3 obs. of 2 variables:
# ..$ id : int [1:3] 12 14 10751
# ..$ name: chr [1:3] "Adventure" "Fantasy" "Family"
# $ :'data.frame': 2 obs. of 2 variables:
# ..$ id : int [1:2] 10749 35
# ..$ name: chr [1:2] "Romance" "Comedy"
# $ :'data.frame': 3 obs. of 2 variables:
# ..$ id : int [1:3] 35 18 10749
# ..$ name: chr [1:3] "Comedy" "Drama" "Romance"
To extract just the names, this works:
lapply(ret, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy" "Family"
# [[2]]
# [1] "Adventure" "Fantasy" "Family"
# [[3]]
# [1] "Romance" "Comedy"
# [[4]]
# [1] "Comedy" "Drama" "Romance"
Note that your columnar output format won't work with the sample you gave me without extending each vector to be the same length (if so, see https://stackoverflow.com/a/34570893/3358272).
ret2 <- lapply(ret, `[[`, "name")
ret2 <- lapply(ret2, `length<-`, max(lengths(ret2)))
do.call(rbind, ret2)
# [,1] [,2] [,3]
# [1,] "Animation" "Comedy" "Family"
# [2,] "Adventure" "Fantasy" "Family"
# [3,] "Romance" "Comedy" NA
# [4,] "Comedy" "Drama" "Romance"
(This is a matrix
, you can take it to the next level.)
Slightly faster with jsonlite::stream_in
(This is treating your data like NDJSON. It's a nuance, not critical.)
If you need to speed this up a little, you can use jsonlite::stream_in
either on the original raw file (preferred) or, if you don't have it in a raw file, then we'll textConnection
ize things are fake it (slightly less efficient than a raw file, but should still be faster than jsonlite::fromJSON
).
jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")))
# Imported 4 records. Simplifying...
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
# arguments imply differing number of rows: 3, 2
I kept this error here to demonstrate that jsonlite
's default is to try to convert simple lists into frames. It doesn't always work, and in this case it is trying to rbind
them, and the third element (with one fewer element) is causing some problem. We can mitigate this by turning off frame-ification.
str(
ret <- jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")),
simplifyDataFrame=FALSE)
)
# Imported 4 records. Simplifying...
# List of 4
# $ :List of 3
# ..$ :List of 2
# .. ..$ id : int 16
# .. ..$ name: chr "Animation"
# ..$ :List of 2
# .. ..$ id : int 35
# .. ..$ name: chr "Comedy"
# ..$ :List of 2
# .. ..$ id : int 10751
# .. ..$ name: chr "Family"
# $ :List of 3
# ..$ :List of 2
# .. ..$ id : int 12
# .. ..$ name: chr "Adventure"
# ..$ :List of 2
# .. ..$ id : int 14
# .. ..$ name: chr "Fantasy"
# ..$ :List of 2
# .. ..$ id : int 10751
# .. ..$ name: chr "Family"
# $ :List of 2
# ..$ :List of 2
# .. ..$ id : int 10749
# .. ..$ name: chr "Romance"
# ..$ :List of 2
# .. ..$ id : int 35
# .. ..$ name: chr "Comedy"
# $ :List of 3
# ..$ :List of 2
# .. ..$ id : int 35
# .. ..$ name: chr "Comedy"
# ..$ :List of 2
# .. ..$ id : int 18
# .. ..$ name: chr "Drama"
# ..$ :List of 2
# .. ..$ id : int 10749
# .. ..$ name: chr "Romance"
This is a little different than the first attempt, but not at all hard to extract.
lapply(ret, sapply, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy" "Family"
# [[2]]
# [1] "Adventure" "Fantasy" "Family"
# [[3]]
# [1] "Romance" "Comedy"
# [[4]]
# [1] "Comedy" "Drama" "Romance"
(Using the same steps to column-ize things as above.)
This call may seem odd in that it is a double-*apply
call. It is equivalent to (but shorter than):
lapply(ret, function(x) sapply(x, `[[`, "name"))