1

How can I install offline Spark NLP packages without internet connection. I've downloaded the package (recognizee_entities_dl) and uploaded it to the cluster.

I've installed Spark NLP using pip install spark-nlp==2.5.5. I'm using PySpark and from the cluster I'm unable to download the packages.

Already tried;

pipeline = PretrainedPipeLine.from_disk('/path/to/recognize_entities_dl')
pipeline = PretrainedPipeLine.load('/path/to/recognize_entities_dl')

Errors:

'PretrainedPipeline' has no attribute 'load'

Input path does not exist:
    hdfs://...../recognize_entities_dl_en_2.4.3_2.4_1584626752821/metatdata
John Doe
  • 9,843
  • 13
  • 42
  • 73
  • 1
    Please add more details, whether you are using Scala Spark or Pyspark? If you are using pyspark, you can always use pip install package_name_downloaded before the start of your application on each node. Ideally, you should install when the cluster is created, installing through docker images is one other option – Karan Sharma Aug 17 '20 at 10:10
  • So, you have installed everything already but are having trouble loading the pretrained recognize entity model from the disk? Do you get any error? – Shaido Aug 20 '20 at 08:20
  • What Spark version are you using? Also, you can check if the file exists using `hdfs dfs -ls /path/to/...`. – Shaido Aug 20 '20 at 09:49
  • Why do you load the model? Because your Apache Spark version <2.4.x? For 2.4.X the code shall be: pipeline = PretrainedPipeline('/path/to/recognize_entities_dl') – SvitlanaGA...supportsUkraine Aug 24 '20 at 08:48
  • https://stackoverflow.com/questions/58522742/unable-to-download-the-pipeline-provided-by-spark-nlp-library - This Might help you – Rahul Raut Aug 26 '20 at 05:41
  • path does not exist maybe because your package name does not match with the package name you have given in your path. Your downloaded package name is 'recognizee_entities_dl' with double e and your path has only one e `/path/to/recognize_entities_dl'` . can you check and confirm? – rmb Aug 26 '20 at 11:59

1 Answers1

1

Looking at your error:

 hdfs://...../recognize_entities_dl_en_2.4.3_2.4_1584626752821/metatdata

metatdata you should change to metadata by removing one extra "t".

Also, You see 2.4.3 in "recognize_entities_dl_en_2.4.3_2.4_1584626752821"

This indicates it is for Spark NLP 2.4.3

But, In the question, you have mentioned you are using,

spark-nlp==2.5.5

Which is okay as long as

2.5.5 >= 2.4.3

But sometimes it causes issues.

Also 2.4 in "recognize_entities_dl_en_2.4.3_2.4_1584626752821"

This indicates it is for Apache Spark 2.4

The Spark NLP library built and compiled against Apache Spark 2.4.x. That is why models and pipelines are only available for the 2.4.x version.

Manish
  • 909
  • 6
  • 15