As a linguist and a python-beginner I want to find word-collocations in my own (german) tweet-corpus. How can I convert the tweets from a pandas dataframe (just one column = tweet) into a list of words to then be able to use the nltk-collocation-finder? My version (below) creates a list of letters and not a list of words and just gives me letter-collocations. Any advice would be great!
This is what I have so far:
import pandas as pd
data = pd.read_csv("tweets.csv")
import regex as re
def cleaningTweets(twt):
twt = re.sub('@[A-ZÜÄÖa-züäöß0-9]+', '', twt)
twt = re.sub('#', '', twt)
twt = re.sub('https?:\/\/\S+', '', twt)
return twt
df = pd.DataFrame(data)
df.tweet = df.tweet.apply(cleaningTweets)
df.tweet = df.tweet.str.lower()
from textblob_de import TextBlobDE as TextBlob
df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))
all_words = ' '.join([text for text in df.tweet_tok])
tweettext = nltk.Text(all_words)