3

I have a large Twitter data stream, and I am interested in analyzing the relationships of hashtags in each tweet. For example, if hashtag A and hashtag B appear in the same tweet, I would record this tweet as "A-B" together with the timestamp of the tweet.

As such, sample inputs are:

hashtags,       Timestamp 
A-B,     created_time: 2016-04-07T01:33:19Z 
B-C,     created_time: 2016-04-07T03:53:19Z 
C,       created_time: 2016-04-08T03:31:19Z
C-A,     created_time: 2016-04-08T04:33:19Z 
A-D,     created_time: 2016-04-07T07:33:19Z  # (Note: an example of out of order)
B-D,     created_time: 2016-04-09T09:33:19Z

Note that the stream data might not be ordered by time.

Tasks: 1) Use the stream data to build a graph of hashtags (A, B, C, C...) and their relationship with one another. 2) Calculate the average degree of a vertex in a graph and update this each time a new stream data appears (across a one-day sliding window).

The average degree of a vertex is defined as: degree = number of edges/number of nodes. For example, if the current graph is A-B, then the average degree = 1(edge)/2 (# of nodes).

Sample outputs:

Output
1/2,
2/3,
1/2,
1/2,
2/3,
1/2

What is the most efficient Python data structure to store a such timestamp data in order to calculate the average degree of vertex in a one-day rolling window?*

My intuition is to use a dictionary to store and maintain the hashtags as key, and the created_time as values. So in order to maintain a one-day window, I need to first sort the dictionary, which takes lots of time. Is there a more efficient way to automatically store the timestamp data based on time (no need for sort)?

I found posts using the Pandas DataFrame and rolling functions to do the similar tasks. But in my case, I am looking for a most efficient data structure to do the task.

Updates: After more research about my question, I found this question is a good match to mine. Ideal data structure with fast lookup, fast update and easy comparison/sorting

The key idea is to use [heapq][2]

Community
  • 1
  • 1
enaJ
  • 1,565
  • 5
  • 16
  • 29
  • 1
    What will be consuming this data once sorted? One, two, multiple processes? Perhaps a database would be appropriate. – Copy and Paste Jul 06 '16 at 20:32
  • 1
    Pandas is very efficient if all your data fits into memory - it's not clear though why do you need `rolling window` functions. Can you post desired data set? – MaxU - stand with Ukraine Jul 06 '16 at 20:46
  • @MaxU, and Copy and Paste, after sorted, I want to calculate the average degree of edges and nodes in a one-day time window. The average degree of edges and nodes is defined as degree = total edges / total nodes. That is why I talked about the rolling window functions. – enaJ Jul 06 '16 at 21:03

2 Answers2

1

The tweets can be expected to be mostly sorted, so a sequence type with insertion sort should be a good way to get them ordered. Add a rolling window to replace the oldest ones after you reach 24 hours.

For efficient insertions, you'll want a sequence type with better insertion support than list. I'd give blist a try. In fact it provides a sortedlist type, so you could try that out and see what kind of performance it achieves.

This all assumes that your stream doesn't grow too fast to keep a whole day's tweets in memory. If it does, you'll have to delegate to some kind of database.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
  • You are right that this problem should be solved using an efficient insertion sort related data structure. But after an exclusive search of all the possible data structures that can do the job, I felt the python heapq should be the best one, since it can maintain insert, delete better for a stream data. – enaJ Jul 08 '16 at 20:37
  • Then heapq it should be. You'll get no argument from me -- I just mentioned some options I've heard of. – alexis Jul 08 '16 at 21:29
1

I would use pandas. Here is an example implementation which sorts out timestamps based on a window. You would need to copy your data into a dataframe first.

import datetime
import dateutil.relativedelta

days_back = 1
datetimeFormat = '%Y-%m-%d %H:%M:%S'
dt_now = datetime.datetime.now()
start_date = dt_now - dateutil.relativedelta.relativedelta(days=days_back)
start_date = start_date.strftime(datetimeFormat)
df2 = df[df['time_stamp'] > start_date]
sparrow
  • 10,794
  • 12
  • 54
  • 74