I am analyzing data from the GDELT database of news documents on Google Cloud. The file contains a column for date, for one of 300 theme codes, and a frequency value.
Here is my data. The sample data file has approximately 46,000 rows: https://docs.google.com/spreadsheets/d/11oUiznvFTKGAOz1QXavbiWH1sxgCJHbFfysu0F0MdKs/edit?usp=sharing
There are 284 unique themes, listed here:
https://docs.google.com/spreadsheets/d/1gN3Vc5W6rGekF8P_Rp73BL2YaO6WTDVp-DpP0Il22vk/edit?usp=sharing
I need to, within each day, create pairs of themes, weighted by the product of their frequencies. Then, I need to output an adjacency list of: theme_A, theme_B, and weight, to subsequently do network analysis on the themes over time. I am stuck at the point of computing the theme cooccurrences.
#Import packages
import pandas as pd
import numpy as np
#Read in data file
df = pd.read_csv(r'C:\Users\james\Desktop\Documents\Downloads\Cybersecurity\cybertime.csv')
df.head
#Create bigrams of themes by days, based on cooccurrences weighted by frequencies.
#Iterate rows until new date is found, then compute weighted cooccurrences.
#Weights are products of theme A frequency (freq) and theme B frequency.
#Output the adjacency list.