Converting 'CategorizedPlaintextCorpusReader' into dataframe

Question

I want to convert movie_reviews dataset from nltk.corpus into dataframe. The purpose is to use this data for sentiment analysis. while converting the data using pandas, I'm getting an error:

    from nltk.corpus import movie_reviews
    import pandas as pd

    mr=movie_reviews
    movie=pd.DataFrame(mr)

ValueError: DataFrame constructor not properly called!

@alvas, now that you've shown how to do it, maybe you should now remove your "it's not possible" claim... — alexis, Sep 08 '17 at 09:42
Ah, it should be "I don't think it's possible to simply initialize it that way" =) — alvas, Sep 08 '17 at 09:54
I don't think it's possible to simply initialize it that way". An NLTK's `CategorizedPlaintextCorpusReader` object isn't a `dtype` for `pandas`. — alvas, Sep 08 '17 at 09:55

alvas · Accepted Answer · 2017-09-08T09:55:45.410

An NLTK's CategorizedPlaintextCorpusReader object isn't a dtype for pandas.

That being said, you can convert the movie reviews into list of tuples and then populate a dataframe as such:

import pandas as pd

from nltk.corpus import movie_reviews as mr

reviews = []
for fileid in mr.fileids():
    tag, filename = fileid.split('/')
    reviews.append((filename, tag, mr.raw(fileid)))

df = pd.DataFrame(reviews, columns=['filename', 'tag', 'text'])

[out]:

>>> df.head()
          filename  tag                                               text
0  cv000_29416.txt  neg  plot : two teen couples go to a church party ,...
1  cv001_19502.txt  neg  the happy bastard's quick movie review \ndamn ...
2  cv002_17424.txt  neg  it is movies like these that make a jaded movi...
3  cv003_12683.txt  neg   " quest for camelot " is warner bros . ' firs...
4  cv004_12641.txt  neg  synopsis : a mentally unstable man undergoing ...

To process the text column, see How to NLTK word_tokenize to a Pandas dataframe for Twitter data?

ralhusban · Answer 2 · 2021-03-17T17:11:04.327

Try this simplified answer:

from nltk.corpus import reuters # Imports Reuters corpus
reuters_cat= reuters.categories() # Creates a list of categories

docs=[] 
for cat in reuters_cat: # We append tuples of each document and categories in a list
    t1=reuters.sents(categories=cat) # At each iteration we retrieve all documents of a given category
    for doc in t1:
        docs.append((' '.join(doc), cat)) # These documents are appended as a tuple (document, category) in the list

reuters_df=pd.DataFrame(docs, columns=['document', 'category']) #The data frame is created using the generated tuple.

reuters_df.head()

Apologies for not adding a dataframe head sample as I'm still new to stackoverflow

Converting 'CategorizedPlaintextCorpusReader' into dataframe

2 Answers2