-1

I've been looking now for some time, and finding a lot of broken examples and links from the past, but I have a 2 GB file of json data that I need to process line by line, run a significant amount of code on each line, and save out reformatted data to the cluster.

I've been trying to do this in Spark 2.0/PySpark, but am not having much luck. I can do it on a smaller file, but on my actual file my director runs out of heap memory.

When I try and break up the file, I get the error listed here (Spark __getnewargs__ error) but for obviously different reasons, as I'm not referencing columns.

I'm on CentOS6 with Hortonworks, single machine cluster for now. I'm actually looking more for "what I should be doing" than just how to do it. I know that Spark can do this, but if there's a better way, then I'm happy to explore that as well.

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48
AHamilton
  • 99
  • 3
  • You have only 2 GB? Then why do you need Hadoop? Note: A single node is not "a cluster" by definition. How much memory are you giving to the Spark task? Default is only 2GB, so that explains exactly why it runs out of memory – OneCricketeer Mar 07 '18 at 02:09
  • A single file that is 2GB. There is quite a bit more stored on there. I've tried allocating as much as 20GB to the task, but it's running out of heap space, so that doesn't help. – AHamilton Mar 07 '18 at 08:47
  • What are you trying to achieve with JSON file? – Pradeep Bhadani Mar 07 '18 at 09:54
  • A number of things, but the problem I'm having at the moment is that I can't even iterate the file when I split it up into chunks that would be more manageable. But ultimately, I'm reading the Json, doing some calculations and evaluation, then, for certain criteria I'm turning that row into a CSV flat file. – AHamilton Mar 07 '18 at 10:55
  • 1
    Spark should work fine, but it's overkill for a relatively medium sized file that fits on a $10 flash drive. There's a difference between the driver memory and the executor memory. Which are you setting? You have 20GB available on a single node? Have you seen https://shapeshed.com/jq-json/ ? Or even this https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html – OneCricketeer Mar 08 '18 at 01:08

1 Answers1

0

You can define a Hive table on top of your JSON file using JSON serde and then can do analysis using Hive or Spark.

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48