0

I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things going. How do I deploy this repo to my cluster? Should this be a part of my bootstrap script?

The end goal is to process the text in the WET files via a spark job. So far I've been using the hosted notebooks to try and download WET files with boto3, with no success.

Here is the code I used to bootstrap EMR with the additional python packages.

willwrighteng
  • 1,411
  • 11
  • 25
  • does your bootstrap script install all the dependencies? `pip install -r requirements.txt` will not work as a bootstrap action. You need to copy the requirements file, or explicitly install the dependencies (they are not that many, i just opened the file). – SQL.injection Sep 28 '20 at 11:49
  • Yes, I believe I captured all the dependencies of that repo in my bootstrap script. – willwrighteng Sep 28 '20 at 18:08
  • 1
    The cc-pyspark project was developed to be run via command-line without the need to install cc-pyspark beforehand in the cluster. Just run one a script and "deploy" sparkcc.py via --py-files: `spark-submit --py-files sparkcc.py word_count.py`. I haven't ever tried to run it from a notebook - some modifications or work-arounds are likely required, mostly because cc-pyspark relies on argparse. – Sebastian Nagel Sep 28 '20 at 20:27
  • Wow, I didn't expect you to respond to this personally. Thank you so much! I only discovered CommonCrawl recently and find it fascinating. I'll give your suggestion a try. Thanks again! – willwrighteng Sep 29 '20 at 02:17

0 Answers0