How do I compute cooccurrences of variable values within days over time to produce an adjacency list?

Question

I am analyzing data from the GDELT database of news documents on Google Cloud. The file contains a column for date, for one of 300 theme codes, and a frequency value.

Here is my data. The sample data file has approximately 46,000 rows: https://docs.google.com/spreadsheets/d/11oUiznvFTKGAOz1QXavbiWH1sxgCJHbFfysu0F0MdKs/edit?usp=sharing

There are 284 unique themes, listed here:
https://docs.google.com/spreadsheets/d/1gN3Vc5W6rGekF8P_Rp73BL2YaO6WTDVp-DpP0Il22vk/edit?usp=sharing

I need to, within each day, create pairs of themes, weighted by the product of their frequencies. Then, I need to output an adjacency list of: theme_A, theme_B, and weight, to subsequently do network analysis on the themes over time. I am stuck at the point of computing the theme cooccurrences.

#Import packages
import pandas as pd
import numpy as np

#Read in data file
df = pd.read_csv(r'C:\Users\james\Desktop\Documents\Downloads\Cybersecurity\cybertime.csv')
df.head

#Create bigrams of themes by days, based on cooccurrences weighted by frequencies.
#Iterate rows until new date is found, then compute weighted cooccurrences.
#Weights are products of theme A frequency (freq) and theme B frequency.


#Output the adjacency list.

Thanks! Once I do the groupby on date, how would I make all theme bigrams (weighted by the product of pair element frequency) for each date? — James Danowski, Jun 03 '19 at 18:49
You file contains 1024 themes. How do you group them to obtain just 300? — Serge Ballesta, Jun 03 '19 at 18:52
groupBy and then apply with a custom function to calculate the weight by frequency , df.groupby(['date', 'theme'])["frequency"].apply(lambda x : x.astype(int).sum() — Tal Avissar, Jun 03 '19 at 19:03
Yes, the custom function is what I need help with :) Just learning python! — James Danowski, Jun 03 '19 at 19:05
I added to the description a link to the theme codebook file. It has the 284 unique themes. — James Danowski, Jun 03 '19 at 19:06
What are you supposed to do with the lines containing themes not listed in the other file, like DRONES, ECON_BITCOIN, ECON_BUDGET_DEFICIT, ... Just ignore them? — Serge Ballesta, Jun 04 '19 at 06:04

score 1 · Accepted Answer · answered Jun 03 '19 at 19:06

1

You can try to use a custom function with groupBy and apply with pandas data frame. see here

or do:

df.groupby(['date', 'theme'])['frequency'].apply(lambda x : x.astype(int).sum()

answered Jun 03 '19 at 19:06

Tal Avissar

10,088
6
45
70

Thanks! This nicely groups data by date and by theme. But, how do I do the next step -- produce the cooccurrences, which for each date lists: theme_A theme_B cooccurrence frequency? E.G.: cooperate terror 61 – James Danowski Jun 03 '19 at 19:19
E.G.: cooperate terror 61 \n cooperate intelligence 14 \n intelligence shortage 77, etc. – James Danowski Jun 03 '19 at 19:28

score 1 · Answer 2 · answered Jun 04 '19 at 07:03

First, you can optionally filter out all rows from the initial csv file containing themes not contained in the GDELT-Global_Knowledge_Graph_CategoryList:

df = pd.read_csv('cybertime.csv')
gdelt = pd.read_csv('GDELT-Global_Knowledge_Graph_CategoryList.csv')
df.drop(df.loc[~df.theme.isin(gdelt.Name)].index, inplace=True)   # optional

Next, you could pivot your dataframe to get a matrix of 30 rows (one per day) and 194 columns (one per theme). If you do not filter you will get a 30x1028 dataframe.

From that point, you can do a matrix product of the transposed matrix by the original matrix: it will give you a 194x194 matrix containing the sums of the products of the frequencies of pair of events (same as above 1028x1028 if unfiltered)

You have just to unpivot (melt) that matrix to get your adjacency list.

Code could be:

df2 = df.pivot(index='date', columns='theme', values='freq').fillna(0)

df3 = pd.DataFrame(np.transpose(df2.values) @ df2.values,
                   index=df2.columns, columns = df2.columns)

df4 = df3.rename_axis('theme_A').reset_index().melt(
    id_vars=['theme_A'], var_name='theme_B', value_name='weight')

How do I compute cooccurrences of variable values within days over time to produce an adjacency list?

2 Answers2