I'm trying out Spark and working with json file.
I have this code so far:
uri = "s3://sparta-data/nyt2.json"
nyt = spark.read.json(uri).rdd
nyt
nyt.first().asDict()
It outputs following:
(1) Spark Jobs
Out[18]: {'_id': Row($oid='5b4aa4ead3089013507db18b'),
'amazon_product_url': 'http://www.amazon.com/Odd-Hours-Dean-Koontz/dp/0553807056?tag=NYTBS-20',
'author': 'Dean R Koontz',
'bestsellers_date': Row($date=Row($numberLong='1211587200000')),
'description': 'Odd Thomas, who can communicate with the dead, confronts evil forces in a California coastal town.',
'price': Row($numberDouble=None, $numberInt='27'),
'published_date': Row($date=Row($numberLong='1212883200000')),
'publisher': 'Bantam',
'rank': Row($numberInt='1'),
'rank_last_week': Row($numberInt='0'),
'title': 'ODD HOURS',
'weeks_on_list': Row($numberInt='1')}
I want to output distinct publishers and also count how many distinct publishers are there.
I'm trying to use this:
nyt.select('publisher').distinct().rdd
Looks like I need to convert to DataFrame. But I have $
.
Can you please direct me?
Kind regards, Anna