I'm trying to do a textual analysis and have collected my data into a CSV document with three columns. I'm trying to combine all the text from the second column into a single string to perform some word analysis (word cloud, frequency etc.) I've imported the CSV file using pandas. In the code below, data
is a DataFrame
object.
# Extract words from comment column in data
words = " "
for msg in data["comment"]:
msg = str(msg).lower()
words = words + msg + " "
print("Length of words is:", len(words))
The output gets parsed using word_cloud.
wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, collocations=False, stopwords = stopwordsTerrier.union(stopwordsExtra)).generate(words)
CSV File
rating, comment, ID
5, It’s just soooo delicious but silly price and postage price, XXX1
5, Love this salad dressing... One my kids will estv😊, XXX2
...
The code works fine for smaller files <240kb etc., but I am recently working with a 50mb file and this has slowed down the script by a lot (179,697 rows) - I'm not sure if it will even finish computing. I am sure that this is the bottleneck because I'm running the script in Jupyter notebook and this is the only code in the cell I am executing.
My question is: Is there a more efficient way of doing this?