Use Spacy with Pandas

Question

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:

Screenshot

Below is the code I used to apply to my full dataset using Pandas:


Messages = pd.read_csv('Messages.csv', encoding='cp1252')
    
Messages['Body'] = Messages['Body'].astype(str)

Messages['NLP_Result'] = nlp(Messages['Body'])._.cats

But it gives me the error:

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>

The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.

Thanks so much

I tried Pandas apply method too, but had no luck. Code used:

Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats

The error I got: AttributeError: 'Series' object has no attribute '_'

Expectation is to generate 3 columns as described above

`I tried Pandas apply method too, but had no luck` The first suggestion I would make is apply(), so it would be helpful to know why it didn't work, and what specific code you used. If the issue is that you need to return multiple columns, you can see how to do that here: https://stackoverflow.com/questions/23586510/return-multiple-columns-from-pandas-apply — Nick ODell, Dec 02 '22 at 03:51
When using the apply method, I had the code written as follows: Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats I get the error: AttributeError: 'Series' object has no attribute '_' I think '._,cats' part is essential as it's the text categorizer and when using the apply, I'm not sure where it should go. — Sang, Dec 02 '22 at 04:10
What if you tried something like `Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)` ? — Nick ODell, Dec 02 '22 at 04:27

Wiktor Stribiżew · Accepted Answer · 2022-12-02T08:55:38.283

You should provide a callable into Series.apply call:

Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

Here, each value in the NLP_Result column will be assigned to x variable.

The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.

import spacy
import classy classification
import csv
import pandas as pd 

with open ('Deliveries.txt', 'r') as d:
    Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
    Not_Spam = n.read().splitlines()

data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam

# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "device": "gpu"
    }
)

Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

Use Spacy with Pandas

1 Answers1