34

Say I have a dataset, like

iris = pd.DataFrame(sns.load_dataset('iris'))

I can use Spacy and .apply to parse a string column into tokens (my real dataset has >1 word/token per entry of course)

import spacy # (I have version 1.8.2)
nlp = spacy.load('en')
iris['species_parsed'] = iris['species'].apply(nlp)

result:

   sepal_length   ... species    species_parsed
0           1.4   ... setosa          (setosa)
1           1.4   ... setosa          (setosa)
2           1.3   ... setosa          (setosa)

I can also use this convenient multiprocessing function (thanks to this blogpost) to do most arbitrary apply functions on a dataframe in parallel:

from multiprocessing import Pool, cpu_count
def parallelize_dataframe(df, func, num_partitions):

    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_partitions)
    df = pd.concat(pool.map(func, df_split))

    pool.close()
    pool.join()
    return df

for example:

def my_func(df):
    df['length_of_word'] = df['species'].apply(lambda x: len(x))
    return df

num_cores = cpu_count()
iris = parallelize_dataframe(iris, my_func, num_cores)

result:

   sepal_length species  length_of_word
0           5.1  setosa               6
1           4.9  setosa               6
2           4.7  setosa               6

...But for some reason, I can't apply the Spacy parser to a dataframe using multiprocessing this way.

def add_parsed(df):
    df['species_parsed'] = df['species'].apply(nlp)
    return df

iris = parallelize_dataframe(iris, add_parsed, num_cores)

result:

   sepal_length species  length_of_word species_parsed
0           5.1  setosa               6             ()
1           4.9  setosa               6             ()
2           4.7  setosa               6             ()

Is there some other way to do this? I'm loving Spacy for NLP but I have a lot of text data and so I'd like to parallelize some processing functions, but ran into this issue.

Max Power
  • 8,265
  • 13
  • 50
  • 91

1 Answers1

42

Spacy is highly optimised and does the multiprocessing for you. As a result, I think your best bet is to take the data out of the Dataframe and pass it to the Spacy pipeline as a list rather than trying to use .apply directly.

You then need to the collate the results of the parse, and put this back into the Dataframe.

So, in your example, you could use something like:

tokens = []
lemma = []
pos = []

for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

df['species_tokens'] = tokens
df['species_lemma'] = lemma
df['species_pos'] = pos

This approach will work fine on small datasets, but it eats up your memory, so not great if you want to process huge amounts of text.

Ed Rushton
  • 759
  • 8
  • 6
  • 15
    so what is the recommended approach for big dataframes? – ℕʘʘḆḽḘ Oct 20 '18 at 01:36
  • 1
    If you have a sequential index, I would enumerate the pipe and add the values inplace using `loc` – Mathew Savage Jan 22 '19 at 14:41
  • To avoid using up the cache memory, you could write to a file on disk one line at a time. In later stages of your analysis, you may use those newly created artifacts. – mabounassif Dec 07 '20 at 19:24
  • 2
    The `n_threads` parameter of `pipe()` was [deprecated in 2019](https://github.com/explosion/spaCy/issues/2075#issuecomment-465966796). `n_process` would be a reasonable substitute. – Cold Fish Sep 12 '22 at 08:55
  • 1
    The `doc.is_parsed` attribute is [deprecated](https://spacy.io/usage/v3#section-incompat) as of spacy v3.0. The documentation suggests using `doc.has_annotation("DEP")` instead. – scign Oct 13 '22 at 17:52