0

I'm trying out Spark and working with json file.

I have this code so far:

uri = "s3://sparta-data/nyt2.json"
nyt = spark.read.json(uri).rdd
nyt


nyt.first().asDict()

It outputs following:

    (1) Spark Jobs
Out[18]: {'_id': Row($oid='5b4aa4ead3089013507db18b'),
 'amazon_product_url': 'http://www.amazon.com/Odd-Hours-Dean-Koontz/dp/0553807056?tag=NYTBS-20',
 'author': 'Dean R Koontz',
 'bestsellers_date': Row($date=Row($numberLong='1211587200000')),
 'description': 'Odd Thomas, who can communicate with the dead, confronts evil forces in a California coastal town.',
 'price': Row($numberDouble=None, $numberInt='27'),
 'published_date': Row($date=Row($numberLong='1212883200000')),
 'publisher': 'Bantam',
 'rank': Row($numberInt='1'),
 'rank_last_week': Row($numberInt='0'),
 'title': 'ODD HOURS',
 'weeks_on_list': Row($numberInt='1')}

I want to output distinct publishers and also count how many distinct publishers are there.

I'm trying to use this:

nyt.select('publisher').distinct().rdd

Looks like I need to convert to DataFrame. But I have $.

Can you please direct me?

Kind regards, Anna

Anna
  • 1
  • 1
  • 4
  • please try to remove the `.rdd` when reading the json file – werner Sep 01 '21 at 18:03
  • Thank you. Much better now. – Anna Sep 01 '21 at 18:37
  • I run this code: nyt.groupBy('publisher').count().orderBy('count').show(). Sorting by ascending. How can I see the publisher who has the most books? I want to sort descending to see publisher with most books – Anna Sep 01 '21 at 18:38
  • please have look at [this answer](https://stackoverflow.com/a/34514782/2129801) – werner Sep 01 '21 at 18:41
  • Thank you, awesome. We are covering Spark SQL tomorrow. But I managed to make it work for my task already. Many thanks – Anna Sep 01 '21 at 19:28

0 Answers0