2

I was dealing with a previous error when trying to perform some Named Entity Recognition with spaCy, relying on Dataproc + PySpark. I have created a brand-new cluster, to deal with "insufficient local disk space", as mentioned in the comments of that case:

gcloud dataproc clusters create spacy_tests \
--autoscaling-policy policy-dbeb \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--num-secondary-workers 2 \
--secondary-worker-boot-disk-size 500 \
--num-secondary-worker-local-ssds 0 \
--image-version 1.5-debian10 \
--properties dataproc:pip.packages=spacy==3.2.1,numpy==1.19.5,dataproc:efm.spark.shuffle=primary-worker \
--optional-components ANACONDA,JUPYTER,DOCKER \
--project mentor-pilot-project

Nevertheless, I have stumbled upon a new error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-9a88da6b7731> in <module>
     39 
     40 # loading spaCy model and broadcasting it
---> 41 broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
     42 
     43 print("DATA READING (OR MANUAL DATA GENERATION)...")

<ipython-input-1-9a88da6b7731> in load_spacy_model()
     20 
     21 def load_spacy_model():
---> 22     import spacy
     23     print("\tLoading spacy model...")
     24     return spacy.load("./spacy_model")  # This model exists locally

/opt/conda/anaconda/lib/python3.7/site-packages/spacy/__init__.py in <module>
      9 
     10 # These are imported as part of the API
---> 11 from thinc.api import prefer_gpu, require_gpu, require_cpu  # noqa: F401
     12 from thinc.api import Config
     13 

/opt/conda/anaconda/lib/python3.7/site-packages/thinc/api.py in <module>
      1 from .config import Config, registry, ConfigValidationError
----> 2 from .initializers import normal_init, uniform_init, glorot_uniform_init, zero_init
      3 from .initializers import configure_normal_init
      4 from .loss import CategoricalCrossentropy, L2Distance, CosineDistance
      5 from .loss import SequenceCategoricalCrossentropy

/opt/conda/anaconda/lib/python3.7/site-packages/thinc/initializers.py in <module>
      2 import numpy
      3 
----> 4 from .backends import Ops
      5 from .config import registry
      6 from .types import FloatsXd, Shape

/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/__init__.py in <module>
      6 
      7 from .ops import Ops
----> 8 from .cupy_ops import CupyOps, has_cupy
      9 from .numpy_ops import NumpyOps
     10 from ._cupy_allocators import cupy_tensorflow_allocator, cupy_pytorch_allocator

/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/cupy_ops.py in <module>
     17 from .. import registry
     18 from .ops import Ops
---> 19 from .numpy_ops import NumpyOps
     20 from . import _custom_kernels
     21 from ..types import DeviceTypes

/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/numpy_ops.pyx in init thinc.backends.numpy_ops()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

The ValueError I am getting is most likely related with "the package sources being different" (quote) but I am not sure if that applies also for my case (In summary and after googling that error, it seems to go away by "uninstalling certain package, and installing it again" or "upgrading certain package"; needless to say, I don't know which package would apply for my case, if any). Nevertheless, I would like to stress the fact that "the default cluster packages shipped with Dataproc (so to speak) cannot be controlled from my side" (or so I think)

NOTES:

  • I have tried to insert the results of pip list and conda list, howevere they were too lenghty; if you need some specific package version, please leave the command in the comment section. Leave another requests there too, I will make sure to be updating this section.

What can be happening here?

David Espinosa
  • 760
  • 7
  • 21

0 Answers0