TextRank with Scattertext Visualisation

Question

I recently tried to visualize TextRank using code, but I realized that the terms in the graph are not lemmatized. Is there a way to fix the following code so that all words in textrank_df['parse'] are lemmatized? I checked the pipeline components and all required components are in place ('tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'), so I'm really not sure where went wrong.

import pytextrank
import spacy
import scattertext as st
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("textrank", last=True)
   
convention_df = textrank_df.assign(
    parse=lambda textrank_df: textrank_df['Combined'].apply(nlp),
)

corpus = st.CorpusFromParsedDocuments(
    convention_df,
    category_col='Response Variable',
    parsed_col='parse',
    feats_from_spacy_doc=st.PyTextRankPhrases()).build()

I tried the following code1, but it shows: AttributeError: module 'pytextrank' has no attribute 'TextRank'. I think it might be something to do with the format after this alteration.

code 1

convention_df = textrank_df.assign( parse=lambda textrank_df: textrank_df['Combined'].apply(lambda x: [token.lemma_ for token in nlp(x)]))

I also tried code 2 which adds use_lemmas=True in PyTextRankPhrases() but did not work as well. The word is still presented in its original form.

code 2

corpus = st.CorpusFromParsedDocuments( convention_df, category_col='Response Variable', parsed_col='parse', feats_from_spacy_doc=st.PyTextRankPhrases(use_lemmas=True)).build()

StackOverflow doesn't have a `scattertext` tag yet, although this would be helpful to add since there are already 18 questions about that (awesome) library. I don't have enough reputation to create tags. — Paco, Jun 27 '23 at 03:52

score 1 · Answer 1 · answered Jun 27 '23 at 03:48

I'm one of the authors of PyTextRank and I've tried out the code shown above.

There are some issues with the usage of scattertext in that example. I don't think the line

convention_df = textrank_df.assign(
    parse=lambda textrank_df: textrank_df['Combined'].apply(nlp),
)

would work correctly. There's no source text defined, from what I can see, and also the textrank_df variable is considered by Python as an undefined value.

Is this code based on the example in scattertext ?https://github.com/JasonKessler/scattertext/blob/master/demo_pytextrank.py

My suggestion would be:

Start with a text source which can be used in a simple spaCy pipeline.
Get the PyTextRank pipeline for spaCy configured and running the way you want it to work.
Then integrate into the declarative pipeline in scattertext and debug that portion.

Might also be good to ask Jason & co. from scattertext for what they'd recommend.

Thank you so much for pointing out the incorrect code! For your question, textrank_df is my own dataset that I am working on. It is correct that the code I showed follows the link that you mentioned, I just changed some parts of it to the variable in my dataset. — Rachel, Jun 27 '23 at 20:43

TextRank with Scattertext Visualisation

1 Answers1