How to process (iterate through) a large JSON file on a hadoop/Spark cluster?

Question

I've been looking now for some time, and finding a lot of broken examples and links from the past, but I have a 2 GB file of json data that I need to process line by line, run a significant amount of code on each line, and save out reformatted data to the cluster.

I've been trying to do this in Spark 2.0/PySpark, but am not having much luck. I can do it on a smaller file, but on my actual file my director runs out of heap memory.

When I try and break up the file, I get the error listed here (Spark __getnewargs__ error) but for obviously different reasons, as I'm not referencing columns.

I'm on CentOS6 with Hortonworks, single machine cluster for now. I'm actually looking more for "what I should be doing" than just how to do it. I know that Spark can do this, but if there's a better way, then I'm happy to explore that as well.

You have only 2 GB? Then why do you need Hadoop? Note: A single node is not "a cluster" by definition. How much memory are you giving to the Spark task? Default is only 2GB, so that explains exactly why it runs out of memory — OneCricketeer, Mar 07 '18 at 02:09
A single file that is 2GB. There is quite a bit more stored on there. I've tried allocating as much as 20GB to the task, but it's running out of heap space, so that doesn't help. — AHamilton, Mar 07 '18 at 08:47
A number of things, but the problem I'm having at the moment is that I can't even iterate the file when I split it up into chunks that would be more manageable. But ultimately, I'm reading the Json, doing some calculations and evaluation, then, for certain criteria I'm turning that row into a CSV flat file. — AHamilton, Mar 07 '18 at 10:55
Spark should work fine, but it's overkill for a relatively medium sized file that fits on a $10 flash drive. There's a difference between the driver memory and the executor memory. Which are you setting? You have 20GB available on a single node? Have you seen https://shapeshed.com/jq-json/ ? Or even this https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html — OneCricketeer, Mar 08 '18 at 01:08

score 0 · Answer 1 · answered Mar 07 '18 at 09:55

0

You can define a Hive table on top of your JSON file using JSON serde and then can do analysis using Hive or Spark.

answered Mar 07 '18 at 09:55

Pradeep Bhadani

4,435
6
29
48

That's assuming Hive is installed. Apache Drill might make more sense from a pure "run and be done" point of view – OneCricketeer Mar 08 '18 at 01:04
As per question, Hortonworks Sandbox is being used which has Hive Installed – Pradeep Bhadani Mar 08 '18 at 08:20
I do have Hive, but to clarify, it's an actual production server, not the sandbox. – AHamilton Mar 08 '18 at 08:50
Further, I'm not trying to do analysis. I actually need to do large scale transformations of the data to create inputs for other applications. – AHamilton Mar 08 '18 at 09:20
You can do transformation with Hive or as suggested you can use Spark on Hive tables to do transformation – Pradeep Bhadani Mar 08 '18 at 10:51

How to process (iterate through) a large JSON file on a hadoop/Spark cluster?

1 Answers1