3

I have an untidy DataFrame of Tweet objects. There are two columns that contain lists: hashtags and expanded_urls. I'm trying to follow tidy data principles by keeping only 1 value at a row/column index.

EDIT: This question was marked as a duplicate of this answer, which simply splits the list into more columns. That doesn't work for me because there could be a variable number of hashtags in 1 tweet.

Here's a sample of my tweet DataFrame:

-----------------------------------------------------------
tweet_id | hashtags       | expanded_urls
-----------------------------------------------------------
  123    | ['lol','bff']  | ['fakeurl.com']
  124    | []             | ['url1.io', 'url2.expanded.co']

There's two possible ways I could go about tidying this data.

1: Simply add new rows to the DataFrame with almost all row contents copied over:

---------------------------------------------
tweet_id | hashtag   | expanded_url
---------------------------------------------
  123    | 'lol'    | 'fakeurl.com'
  123    | 'bff'    | 'fakeurl.com'
  124    | ''       | 'url1.io'
  124    | ''       | 'url2.expanded.io'

I don't think this would be very efficient, especially because there would be many insert/append operations. However, having a single DataFrame to pass into a single scikit-learn model would make things very simple.

2: Create 2 new DataFrames:

The first would be hashtags with their corresponding tweet_ids:

------------------
tweet_id | hashtag
------------------
123      | `lol`
123      | `bff`

The other would be urls with their corresponding tweet_ids:

------------------
tweet_id | url
------------------
123      | `fakeurl.com`
124      | `url1.io`
124      | `url2.expanded.co`

This seems cleaner, but I'm not entirely sure how I would modify the original DataFrame; would I just drop the corresponding columns and keep 3 separate tables? Is there a good way of merging these 3 DataFrames into 1, or would I have to do a separate lookup every time I wanted to know which hashtags are associated with a tweet?

s_khillon
  • 81
  • 3
  • 7
  • What is your use case? Feeding data to a scikit model for prediction? What sort of processing will you perform on data? For instance, for each tag, would you perform any aggregation? – Espanta Mar 15 '18 at 04:34
  • Possible duplicate of [Pandas split column of lists into multiple columns](https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns) – Vaishali Mar 15 '18 at 04:34

2 Answers2

3

I reassign over df to turn empty lists into lists of a single empty string

Both columns together

from itertools import product

df = df.applymap(lambda x: x if x else [''])

pd.DataFrame([
    [t, h, e]
    for t, h_, e_ in df.values
    for h, e in product(h_, e_)
], columns=df.columns)

   tweet_id hashtags     expanded_urls
0       123      lol       fakeurl.com
1       123      bff       fakeurl.com
2       124                    url1.io
3       124           url2.expanded.co

Or without itertools

df = df.applymap(lambda x: x if x else [''])

pd.DataFrame([
    [t, h, e]
    for t, h_, e_ in df.values
    for h in h_ for e in e_
], columns=df.columns)

   tweet_id hashtags     expanded_urls
0       123      lol       fakeurl.com
1       123      bff       fakeurl.com
2       124                    url1.io
3       124           url2.expanded.co

Separately

pd.DataFrame(dict(
    tweet_id=df.tweet_id.values.repeat(df.hashtags.str.len()),
    hashtags=np.concatenate(df.hashtags.values)
), columns=['tweet_id', 'hashtags'])

   tweet_id hashtags
0       123      lol
1       123      bff

pd.DataFrame(dict(
    tweet_id=df.tweet_id.values.repeat(df.expanded_urls.str.len()),
    expanded_urls=np.concatenate(df.expanded_urls.values)
), columns=['tweet_id', 'expanded_urls'])

   tweet_id     expanded_urls
0       123       fakeurl.com
1       124           url1.io
2       124  url2.expanded.co
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Assuming the index is on tweet_id (if not you can with .set_index() method), for approach 2, you can try:

df['hashtags'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('hashtag')

Result:
               hashtag
tweet_id             
123               lol
123               bff

Similarly for expanded_urls:

df['expanded_urls'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('url')

Result:

                  url
tweet_id                  
123            fakeurl.com
124                url1.io
124       url2.expanded.co
niraj
  • 17,498
  • 4
  • 33
  • 48