2

I have a text classification problem.

I'm particularly interested in this embedding model in sparknlp because I have a dataset from Wikipedia in 'sq' language. I need to convert sentences of my dataset into embeddings.

I do so by WordEmbeddingsModel, however, after the embeddings are generated I don't know how to prepare them to make ready as an input for an RNN model using keras and tensorflow.

My dataset has two columns 'text' and 'label', until now I was able to do the following steps:

# start spark session
spark = sparknlp.start(gpu=True)

# convert train df into spark df

spark_train_df=spark.createDataFrame(train)`

+--------------------+-----+
|                text|label|
+--------------------+-----+
|Joy Adowaa Buolam...|    0|
|Ajo themeloi "Alg...|    1|
|Buolamwini lindi ...|    1|
|Kur ishte 9 vjeç,...|    0|
|Si një studente u...|    1|
+--------------------+-----+

# define sparknlp pipeline

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(\["document"\]) \
.setOutputCol("token")

embeddings = WordEmbeddingsModel\
.pretrained("w2v_cc_300d","sq")\
.setInputCols(\["document", "token"\])\
.setOutputCol("embeddings")

pipeline = Pipeline(stages=\[document, tokenizer, embeddings\])

# fit the pipeline to the training data

model = pipeline.fit(spark_train_df)

# apply the pipeline to the training data

result = model.transform(spark_train_df)
result.show()


+--------------------+-----+--------------------+--------------------+--------------------+
|                text|label|            document|               token|          embeddings|
+--------------------+-----+--------------------+--------------------+--------------------+
|Joy Adowaa Buolam...|    0|[{document, 0, 13...|[{token, 0, 2, Jo...|[{word_embeddings...|
|Ajo themeloi "Alg...|    1|[{document, 0, 13...|[{token, 0, 2, Aj...|[{word_embeddings...|
|Buolamwini lindi ...|    1|[{document, 0, 94...|[{token, 0, 9, Bu...|[{word_embeddings...|
|Kur ishte 9 vjeç,...|    0|[{document, 0, 12...|[{token, 0, 2, Ku...|[{word_embeddings...|
|Si një studente u...|    1|[{document, 0, 15...|[{token, 0, 1, Si...|[{word_embeddings...|
|Buolamwini diplom...|    1|[{document, 0, 11...|[{token, 0, 9, Bu...|[{word_embeddings...|
+--------------------+-----+--------------------+--------------------+--------------------+

The schema of result is:

result.printSchema()



root
 |-- text: string (nullable = true)
 |-- label: long (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

The output I receive from:

result.schema["embeddings"].dataType is:

ArrayType(StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), False), StructField('end', IntegerType(), False), StructField('result', StringType(), True), StructField('metadata', MapType(StringType(), StringType(), True), True), StructField('embeddings', ArrayType(FloatType(), False), True)]), True)
Aiha
  • 41
  • 7

1 Answers1

0

To extract embeddings generated from SparkNLP WordEmbeddingsModel for a RNN model in Keras and TensorFlow, convert the Spark DataFrame to a Pandas DataFrame, retrieve the embeddings using iloc, convert them into a numpy array, split the dataset into training and testing sets, define the RNN model using Keras and TensorFlow, train the model on the training set, and evaluate the model's performance on the testing set.

Goran
  • 191
  • 1
  • 6