1

I am analyzing data from the GDELT database of news documents on Google Cloud. The file contains a column for date, for one of 300 theme codes, and a frequency value.

Here is my data. The sample data file has approximately 46,000 rows: https://docs.google.com/spreadsheets/d/11oUiznvFTKGAOz1QXavbiWH1sxgCJHbFfysu0F0MdKs/edit?usp=sharing

There are 284 unique themes, listed here:
https://docs.google.com/spreadsheets/d/1gN3Vc5W6rGekF8P_Rp73BL2YaO6WTDVp-DpP0Il22vk/edit?usp=sharing

I need to, within each day, create pairs of themes, weighted by the product of their frequencies. Then, I need to output an adjacency list of: theme_A, theme_B, and weight, to subsequently do network analysis on the themes over time. I am stuck at the point of computing the theme cooccurrences.

#Import packages
import pandas as pd
import numpy as np

#Read in data file
df = pd.read_csv(r'C:\Users\james\Desktop\Documents\Downloads\Cybersecurity\cybertime.csv')
df.head

#Create bigrams of themes by days, based on cooccurrences weighted by frequencies.
#Iterate rows until new date is found, then compute weighted cooccurrences.
#Weights are products of theme A frequency (freq) and theme B frequency.


#Output the adjacency list.

2 Answers2

1

You can try to use a custom function with groupBy and apply with pandas data frame. see here

or do:

df.groupby(['date', 'theme'])['frequency'].apply(lambda x : x.astype(int).sum() 
Tal Avissar
  • 10,088
  • 6
  • 45
  • 70
  • Thanks! This nicely groups data by date and by theme. But, how do I do the next step -- produce the cooccurrences, which for each date lists: theme_A theme_B cooccurrence frequency? E.G.: cooperate terror 61 – James Danowski Jun 03 '19 at 19:19
  • E.G.: cooperate terror 61 \n cooperate intelligence 14 \n intelligence shortage 77, etc. – James Danowski Jun 03 '19 at 19:28
1

First, you can optionally filter out all rows from the initial csv file containing themes not contained in the GDELT-Global_Knowledge_Graph_CategoryList:

df = pd.read_csv('cybertime.csv')
gdelt = pd.read_csv('GDELT-Global_Knowledge_Graph_CategoryList.csv')
df.drop(df.loc[~df.theme.isin(gdelt.Name)].index, inplace=True)   # optional

Next, you could pivot your dataframe to get a matrix of 30 rows (one per day) and 194 columns (one per theme). If you do not filter you will get a 30x1028 dataframe.

From that point, you can do a matrix product of the transposed matrix by the original matrix: it will give you a 194x194 matrix containing the sums of the products of the frequencies of pair of events (same as above 1028x1028 if unfiltered)

You have just to unpivot (melt) that matrix to get your adjacency list.

Code could be:

df2 = df.pivot(index='date', columns='theme', values='freq').fillna(0)

df3 = pd.DataFrame(np.transpose(df2.values) @ df2.values,
                   index=df2.columns, columns = df2.columns)

df4 = df3.rename_axis('theme_A').reset_index().melt(
    id_vars=['theme_A'], var_name='theme_B', value_name='weight')
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252