3

I am loading one of the 5-core datasets from

http://jmcauley.ucsd.edu/data/amazon/

using

library(sparklyr)
library(dplyr)

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "2G"
sc = spark_connect(master = "local",config = config)
df=spark_read_json(sc = sc, name = "videos", path = "Path/to/reviews_Office_Products_5.json")

where one of the variables is a column of text reviews, likewise:

select(df,reviewText)

# Source: lazy query [?? x 1]

# Database: spark_connection reviewText

1 I bought my first HP12C in about 1984 or so, and it served me faithfully until 2002 wh

2 "WHY THIS BELATED REVIEW? I feel very obliged to share my views about this old workhor

3 I have an HP 48GX that has been kicking for more than twenty years and an HP 11 that i

4 I've started doing more finance stuff recently and went looking for a good time-value-

5 For simple calculations and discounted cash flows, this one is still the best. I used

6 While I don't have an MBA, it's hard to believe that a calculator I learned how to use

7 I've had an HP 12C ever since they were first available, roughly twenty years ago. I'

8 Bought this for my boss because he lost his. He loves this calculator & would not be

9 This is a well-designed, simple calculator that handles typical four-function math. La

10 I love this calculator, big numbers and calculate excellent so easy to use and make my

# ... with more rows

I want to split the reviews into tokens, with each row containing a word, but that has proven to be difficult. When I try to use the function unnest_tokens, I get the following error message:

library(stringr)
library(tidytext) 

Word_by_Word <- df %>% unnest_tokens(word, reviewText)

Error in unnest_tokens_.default(., word, reviewText) : unnest_tokens expects all columns of input to be atomic vectors (not lists)

What is happening? How do I fix this without using the command "pull" and coercing the data into the requested format? I can not pull the data as suggested in Extract a dplyr tbl column as a vector or convert the data to a tibble format, btw, because if the database is too big and I do any of those, then the computer runs out of memory even after increasing the 2G limit and running the program on a computer with a lot of memory (that's the hole point of using dplyr instead).

AngryR11
  • 93
  • 1
  • 6

1 Answers1

1

It appears that you already have the dataframe in memory. If so, then error code is pointing the way for you. Each entry in reviewText is a list, and unnest_tokens() expects them to be of class vector.

Try using unlist() to transform the reviewText field, either in-place or via mutate().

TTNK
  • 414
  • 2
  • 6
  • when I try using `df %>% mutate(rt=unlist(reviewText))`, I get an error message saying that the function does not exist. Could you please be more specific as to how to use unlist? – AngryR11 Aug 23 '17 at 23:08
  • `unlist()` is a base R function, so is the error in response to `mutate()`? – TTNK Aug 23 '17 at 23:17
  • the error starts like this: `Error: org.apache.spark.sql.AnalysisException: Undefined function: 'UNLIST'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.;` I think the error might be caused because unlist is not a valid function for mutate. – AngryR11 Aug 23 '17 at 23:20
  • Ok, that makes sense. The object `df` is not actually a dataframe in memory, but `spark_tbl` object? Can you use `sdf_mutate()` instead of `mutate()`? – TTNK Aug 23 '17 at 23:27
  • Yes, `df` is a spark_tbl. I got the following error `Error in UseMethod("sdf_register") : no applicable method for 'sdf_register' applied to an object of class "character"` after running `df %>% sdf_mutate(rt=unlist(reviewText))` – AngryR11 Aug 23 '17 at 23:35