I've been looking now for some time, and finding a lot of broken examples and links from the past, but I have a 2 GB file of json data that I need to process line by line, run a significant amount of code on each line, and save out reformatted data to the cluster.
I've been trying to do this in Spark 2.0/PySpark, but am not having much luck. I can do it on a smaller file, but on my actual file my director runs out of heap memory.
When I try and break up the file, I get the error listed here (Spark __getnewargs__ error) but for obviously different reasons, as I'm not referencing columns.
I'm on CentOS6 with Hortonworks, single machine cluster for now. I'm actually looking more for "what I should be doing" than just how to do it. I know that Spark can do this, but if there's a better way, then I'm happy to explore that as well.