1

I have a dataframe that has text in it. There are some words like isn't, couldn't etc..which need to be expanded.

For example:

I'd -> I would
I'd -> I had

Below is the dataframe

DataFrame:

temp = spark.createDataFrame([
    (0, "Julia isn't awesome"),
    (1, "I wish Java-DL couldn't use case-classes"),
    (2, "Data-science wasn't my subject"),
    (3, "Machine")
], ["id", "words"])

+---+----------------------------------------+
|id |words                                   |
+---+----------------------------------------+
|0  |Julia isn't awesome                     |
|1  |I wish Java-DL couldn't use case-classes|
|2  |Data-science wasn't my subject          |
|3  |Machine                                 |
+---+----------------------------------------+

I am trying to search for a library in pyspark but haven't got it..How to achieve this?

Output:

+---+-----------------------------------------+
|id |words                                    |
+---+-----------------------------------------+
|0  |Julia is not awesome                     |
|1  |I wish Java-DL could not use case-classes|
|2  |Data-science was not my subject          |
|3  |Machine                                  |
+---+-----------------------------------------+
merkle
  • 1,585
  • 4
  • 18
  • 33

1 Answers1

2

There may not be a pyspark library to do this but you could use any python library. There are several solutions here. For example, if you go with the pycontractions library then you could write a function and apply() it to the dataframe.

from pycontractions import Contractions

# Load your favorite word2vec model - need to download this, available at pycontractions ink
cont = Contractions('GoogleNews-vectors-negative300.bin')
# optional, prevents loading on first expand_texts call
cont.load_models()

def expand_contractions(text):
    out = list(cont.expand_texts([text], precise=True))
    return out[0]

temp = temp.withColumn('expanded_words', temp['words'].apply(expand_contractions))
edesz
  • 11,756
  • 22
  • 75
  • 123
viggnah
  • 1,709
  • 1
  • 3
  • 12