1

I am attempting to perform some entity extraction, using a custom NER spaCy model. The extraction will be done over a Spark Dataframe, and everything is being orchestrated in a Dataproc cluster (using a Jupyter Notebook, available in the "Workbench"). The code I am using, looks like follows:

# IMPORTANT: NOTICE THIS CODE WAS RUN FROM A JUPYTER NOTEBOOK (!)

import pandas as pd
import numpy as np
import time

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import ArrayType, StringType

spark = SparkSession.builder.appName('SpacyOverPySpark') \
                    .getOrCreate()


# FUNCTIONS DEFINITION

def load_spacy_model():
    import spacy
    print("Loading spacy model...")
    return spacy.load("./spacy_model")  # This model exists locally


@pandas_udf(ArrayType(StringType()))
def entities(list_of_text: pd.Series) -> pd.Series:
    # retrieving the shared nlp object
    nlp = broadcasted_nlp.value
    # batch processing our list of text
    docs = nlp.pipe(list_of_text)
    # entity extraction (`ents` is a list[list[str]])
    ents=[
        [ent.text for ent in doc.ents]
        for doc in docs
    ]
    return pd.Series(ents)


# DUMMY DATA FOR THIS TEST

pdf = pd.DataFrame(
    [
        "Pyhton and Pandas are very important for Automation",
        "Tony Stark is a Electrical Engineer",
        "Pipe welding is a very dangerous task in Oil mining",
        "Nursing is often underwhelmed, but it's very interesting",
        "Software Engineering now opens a lot of doors for you",
        "Civil Engineering can get exiting, as you travel very often",
        "I am a Java Programmer, and I think I'm quite good at what I do",
        "Diane is never bored of doing the same thing all day",
        "My father is a Doctor, and he supports people in condition of poverty",
        "A janitor is required as soon as possible"
    ],
    columns=['postings']
)
sdf=spark.createDataFrame(pdf)


# MAIN CODE

# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
# Extracting entities
df_new = sdf.withColumn('skills',entities('postings'))
# Displaying results
df_new.show(10, truncate=20)

The error code I am getting, looks similar to this, but the answer does not apply for my case, because it deals with "executing a Pyspark job in Yarn" which is different (or so I think, feel free to correct me). Plus, I have also found this, but the answer is rather vague (I gotta be honest here: the only thing I have done to "restart the spark session" is to run spark.stop() in the last cell of my Jupyter Notebook, and then run the cells above again, feel free to correct me here too).

The code used was heavily inspired by "Answer 2 of 2" in this forum, which makes me wonder if some missing setting is still eluding me (BTW, "Answer 1 of 2" was already tested but did not work). And regarding my specific software versions, they can be found here.

Thank you.

CLARIFICATIONS:

Because some queries or hints generated in the comment section can be lengthy, I have decided to include them here:

  • No. 1: "Which command did you use to create your cluster?" : I used this method, so the command was not visible "at plain sight"; I have just realized however that, when you are about to create the cluster, you have an "EQUIVALENT COMMAND LINE" button, that grants access to such command:

enter image description here

In my case, the Dataproc cluster creation code (automatically generated by GCP) is:

gcloud dataproc clusters create my-cluster \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 2.0-debian10 \
--optional-components JUPYTER \
--metadata PIP_PACKAGES=spacy==3.2.1 \
--project hidden-project-name

Notice how spaCy is installed in the metadata (following these recommendations); however running pip freeze | grep spacy command, right after the Dataproc cluster creation, does not display any result (i.e., spaCy does NOT get installed successfully). To enable it, the official method is used afterwards.

  • No. 2: "Wrong path as possible cause" : Not my case, it actually looks similar to this case (even when I can't say the root case is the same for both):
    • Running which python shows /opt/conda/miniconda3/bin/python as result.
    • Running which spacy (read "Clarification No. 1") shows /opt/conda/miniconda3/bin/spacy as result.
General Grievance
  • 4,555
  • 31
  • 31
  • 45
David Espinosa
  • 760
  • 7
  • 21
  • Can you try the -pip list to check if it's included in your packages? – Poala Astrid Aug 25 '22 at 05:56
  • Hello @PoalaAstrid, not sure if you want to double check if the spaCy library is installed or not, please let me know in the comments if so (or if you want the whole list of packages installed). I will assume you do, in which case the answer is yes, I do have it installed in my environment: `spacy==3.2.1`, `spacy-legacy==3.0.9`, `spacy-loggers==1.0.3`. – David Espinosa Aug 25 '22 at 18:02
  • @PoalaAstrid, BTW I have also updated reference [6] on my original post, so you can have a glimpse at the whole list of packages and libraries. Thanks – David Espinosa Aug 25 '22 at 18:08
  • Could you add more details about how you installed `spacy`? Did you use the approach described in https://cloud.google.com/dataproc/docs/tutorials/python-configuration? – Dagang Aug 27 '22 at 17:36
  • Hi @DavidEspinosa, correct me if I'm wrong, but is this what your error message says "ModuleNotFoundError: No module named 'spacy'"? I got it from the link you provided since you said you got a similar error. This error could also occur when the path is wrong, you might want to check it again. – Poala Astrid Aug 29 '22 at 03:38
  • Hello @PoalaAstrid. Not the case, mine is similar to this one https://stackoverflow.com/q/69716018/16706763 (sadly still unsolved there), but the path is `/opt/conda/anaconda/bin` – David Espinosa Aug 29 '22 at 17:32
  • `--metadata PIP_PACKAGES=spacy==3.2.1 ` is an outdated way to install packages which works with Dataproc 1.3, and it requires `--initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh`. Dataproc 2.0 uses https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20 – Dagang Aug 30 '22 at 01:15
  • @DavidEspinosa your model here `spacy.load("./spacy_model")` that exists in your local, I replaced it with `spacy.load("en_core_web_sm")` since I was replicating your case and I don't have the same model as you. I have successfully run it and I will post what I got as an answer. Kindly comment under that if it will work for you. – Poala Astrid Aug 30 '22 at 03:10
  • Hello @Dagang, "Clarification" section updated, please check updated comments. – David Espinosa Aug 30 '22 at 16:52
  • I see what you did, just wondering which doc misguided you to think metadata PIP_PACKAGES is the way to install pip packages? – Dagang Aug 30 '22 at 16:58
  • @Dagang the same than you provided me a couple of days ago: https://cloud.google.com/dataproc/docs/tutorials/python-configuration . In general and unfortunately, I have not found any Google Doc about "additional packages installation during Dataproc cluster creation, using Dataproc Console" , which makes cluster creation a bit more difficult. – David Espinosa Aug 30 '22 at 17:03
  • The doc is actually correct, different versions of Dataproc have different ways to install, in your case, it is 2.0 https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20 – Dagang Aug 30 '22 at 17:25

1 Answers1

1

I managed to solve this issue, by combining 2 pieces of information:

  • "Configure Dataproc Python environment", "Dataproc image version 2.0" (as that is the version I am using): available here (special thanks to @Dagang in the comment section).
  • "Create a (Dataproc) cluster": available here.

In specific, during the Dataproc cluster setup via Google Console, I "installed" spaCy by doing:

enter image description here

And when the cluster was already created, I ran the code mentioned in my original post (NO modifications) with the following result:

enter image description here

That solves my original question. I am planning to apply my solution on a larger dataset, but I think whatever happen there, is subject of a different thread.

David Espinosa
  • 760
  • 7
  • 21