I have a large Twitter data stream, and I am interested in analyzing the relationships of hashtags in each tweet. For example, if hashtag A and hashtag B appear in the same tweet, I would record this tweet as "A-B" together with the timestamp of the tweet.
As such, sample inputs are:
hashtags, Timestamp
A-B, created_time: 2016-04-07T01:33:19Z
B-C, created_time: 2016-04-07T03:53:19Z
C, created_time: 2016-04-08T03:31:19Z
C-A, created_time: 2016-04-08T04:33:19Z
A-D, created_time: 2016-04-07T07:33:19Z # (Note: an example of out of order)
B-D, created_time: 2016-04-09T09:33:19Z
Note that the stream data might not be ordered by time.
Tasks: 1) Use the stream data to build a graph of hashtags (A, B, C, C...) and their relationship with one another. 2) Calculate the average degree of a vertex in a graph and update this each time a new stream data appears (across a one-day sliding window).
The average degree of a vertex is defined as: degree = number of edges/number of nodes. For example, if the current graph is A-B, then the average degree = 1(edge)/2 (# of nodes).
Sample outputs:
Output
1/2,
2/3,
1/2,
1/2,
2/3,
1/2
What is the most efficient Python data structure to store a such timestamp data in order to calculate the average degree of vertex in a one-day rolling window?*
My intuition is to use a dictionary to store and maintain the hashtags as key, and the created_time
as values. So in order to maintain a one-day window, I need to first sort the dictionary, which takes lots of time. Is there a more efficient way to automatically store the timestamp data based on time (no need for sort)?
I found posts using the Pandas DataFrame
and rolling functions to do the similar tasks. But in my case, I am looking for a most efficient data structure to do the task.
Updates: After more research about my question, I found this question is a good match to mine. Ideal data structure with fast lookup, fast update and easy comparison/sorting
The key idea is to use [heapq][2]