I have an untidy DataFrame of Tweet objects. There are two columns that contain lists: hashtags
and expanded_urls
. I'm trying to follow tidy data principles by keeping only 1 value at a row/column index.
EDIT: This question was marked as a duplicate of this answer, which simply splits the list into more columns. That doesn't work for me because there could be a variable number of hashtags in 1 tweet.
Here's a sample of my tweet
DataFrame:
-----------------------------------------------------------
tweet_id | hashtags | expanded_urls
-----------------------------------------------------------
123 | ['lol','bff'] | ['fakeurl.com']
124 | [] | ['url1.io', 'url2.expanded.co']
There's two possible ways I could go about tidying this data.
1: Simply add new rows to the DataFrame with almost all row contents copied over:
---------------------------------------------
tweet_id | hashtag | expanded_url
---------------------------------------------
123 | 'lol' | 'fakeurl.com'
123 | 'bff' | 'fakeurl.com'
124 | '' | 'url1.io'
124 | '' | 'url2.expanded.io'
I don't think this would be very efficient, especially because there would be many insert/append operations. However, having a single DataFrame to pass into a single scikit-learn model would make things very simple.
2: Create 2 new DataFrames:
The first would be hashtags with their corresponding tweet_id
s:
------------------
tweet_id | hashtag
------------------
123 | `lol`
123 | `bff`
The other would be urls with their corresponding tweet_id
s:
------------------
tweet_id | url
------------------
123 | `fakeurl.com`
124 | `url1.io`
124 | `url2.expanded.co`
This seems cleaner, but I'm not entirely sure how I would modify the original DataFrame; would I just drop the corresponding columns and keep 3 separate tables? Is there a good way of merging these 3 DataFrames into 1, or would I have to do a separate lookup every time I wanted to know which hashtags are associated with a tweet?