2

As a linguist and a python-beginner I want to find word-collocations in my own (german) tweet-corpus. How can I convert the tweets from a pandas dataframe (just one column = tweet) into a list of words to then be able to use the nltk-collocation-finder? My version (below) creates a list of letters and not a list of words and just gives me letter-collocations. Any advice would be great!

This is what I have so far:

import pandas as pd
data = pd.read_csv("tweets.csv")

import regex as re
def cleaningTweets(twt):
    twt = re.sub('@[A-ZÜÄÖa-züäöß0-9]+', '', twt)
    twt = re.sub('#', '', twt)
    twt = re.sub('https?:\/\/\S+', '', twt)
    return twt

df = pd.DataFrame(data)

df.tweet = df.tweet.apply(cleaningTweets)
df.tweet = df.tweet.str.lower()

from textblob_de import TextBlobDE as TextBlob
df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))

all_words = ' '.join([text for text in df.tweet_tok])
tweettext = nltk.Text(all_words)
  • So, you're saying that `all_words` doesn't have the data you want it to have? Do you want it to be a list of words, but instead it's a string? The `join()` method is usually used to convert a list to a string, so maybe you've got something the wrong way round here? Please give me more information. If possible, try to exclude `nltk` from the problem, and just focus on the actual data manipulation that needs to be done. That way, not only are you helping yourself focus on the actual problem, but also increasing the chance that someone will actually give you a good answer. :) – BubbleMaster Apr 04 '21 at 10:16
  • 1
    @BubbleMaster: Thanks for the advice! I thought the pandas data frame would consist of strings (each tweet = a string) and that I would need a list of words to be able to apply the nltk-collocation-finder. I want all_words to be al list of words. Think I can´t hide that I´m a bit confused when it comes to the difference between data frame, string and list. – Forest Runner Apr 04 '21 at 10:30
  • The output of `all_words = ' '.join([text for text in df.tweet_tok])` is a string, whereas you want `all_words` to be a list. You can always display the variable type with `print(type(variable_name))`. That way, you'll begin to familiarize yourself with the output types of various methods. I won't bother explaining the differences between data frames, strings and lists, as I'm sure you can google that yourself. – BubbleMaster Apr 04 '21 at 10:41
  • 1
    @BubbleMaster: Thanks for the print-statement, that displays the variable type. I tried this code and it seems to work: `all_words = ' '.join([i for i in df['tweet_tok']]).split()` Does it make sense? – Forest Runner Apr 04 '21 at 11:46
  • 1
    @BubbleMaster: And thanks for your suggestions about the variable types. – Forest Runner Apr 04 '21 at 11:56
  • You're welcome! :) Performing `join()` followed by `split()` makes sense only if the character by which the words were joined isn't the same as the character by which the words were split. Otherwise, it's redundant, as `join()` and `split()` are practically the opposite of one another. Since [`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) splits by spaces by default, I'd say that the statement you wrote doesn't particularly make sense. In other words, the effect of `' '.join().split()` on the `all_words` variable is non-existent, so you might as well not have done it. – BubbleMaster Apr 04 '21 at 14:35
  • @BubbleMaster: I understand your suggestions. But it seems to work. If I leave the split-part `split()` than my output is a string and when I add `split()` my output is a list. And on this list I could apply the collocation-finder. The results even made sense. What would you do differently? – Forest Runner Apr 04 '21 at 17:32
  • If you know how to debug, I would set a breakpoint after the `all_words = ' '.join([i for i in df['tweet_tok']]).split()` line. Otherwise, I'd add a `print(all_words)` after that line, compare the output of `print()` with the `all_words` line commented out and without that line commented out, and see if it there's a difference. – BubbleMaster Apr 04 '21 at 17:40

1 Answers1

1

If all you are after is a list of words from a sentence, I think you are looking for the .split method on a Python string object. Pandas has a built-in method to apply string splitting to each row in a DataFrame (or Series), and expand out to individual columns if you need it.

For example, try this little piece of code and see if it does what you want:

import pandas as pd
strings_to_split = [
    "i like to be beside the sea",
    "me too"
]
pd.Series(strings_to_split).str.split(expand=True)

A couple of notes:

  • Simply calling .split() splits on whitespace, but you can pass any character to perform the split, eg .split('a')
  • Per the question in the comments below, pass expand=False to keep the list in each row instead of expanding out to columns
DaveB
  • 452
  • 2
  • 7
  • Thanks! It creates a data frame in which every word is in a seperate column. Do you have a suggestion how to convert the data frame with only one column (tweets) in a list of words. I think I need to create a new variable, on which I can apply the nltk-collocation-finder. – Forest Runner Apr 04 '21 at 10:58
  • Yes, you should pass `expand=False` to get what you're after :) Each row will now be a list of words. You can rename the column to "tweets" (or whatever) with `.to_frame('tweets')` – DaveB Apr 06 '21 at 09:03