The problem is to save a dictionary for data analysis so that it will scale. I am performing 10000 search and based on the results I am saving a dictionary for every query. Finally, I end up with a dictionary like the following:
{
'query_1' : {'has_result': True (or False),
'direct_result': True (or False),
'title': "title_1",
'summary': "summary_1",
'infobox': {'header_11': "data_11",
'header_12': "data_12",
.
.
.
}
'query_2' : {'has_result': True (or False),
'direct_result': True (or False),
'title': "title_2",
'summary': "summary_2",
'infobox': {'header_21': "data_21",
'header_22': "data_22",
.
.
.
}
.
.
.
}
The problematic part is obviously 'infobox'. I have no idea how many key-value pair I will get for each 'infobox' (usually not more than 50). And the keys are expected to be different for each infobox.
Right now, I can only think of the following way to save the data as a csv.
+---------+------------+---------------+---------+-----------+----------------+--------------+
| query | has_result | direct_result | title | summary | infobox_header | infobox_data |
+---------+------------+---------------+---------+-----------+----------------+--------------+
| query_1 | TRUE | TRUE | title_1 | summary_1 | header_1 | data_1 |
| query_1 | TRUE | TRUE | title_1 | summary_1 | header_2 | data_2 |
| query_1 | TRUE | TRUE | title_1 | summary_1 | header_3 | data_3 |
| query_1 | TRUE | TRUE | title_1 | summary_1 | header_4 | data_4 |
| query_1 | TRUE | TRUE | title_1 | summary_1 | header_5 | data_5 |
| query_2 | TRUE | FALSE | title_2 | summary_2 | header_1 | data_1 |
| query_2 | TRUE | FALSE | title_2 | summary_2 | header_2 | data_2 |
| query_2 | TRUE | FALSE | title_2 | summary_2 | header_3 | data_3 |
| query_2 | TRUE | FALSE | title_2 | summary_2 | header_4 | data_4 |
+---------+------------+---------------+---------+-----------+----------------+--------------+
The problem with my solution is, 'title' and 'summary' is a string variable. For 10000 queries, this is not a big deal. I end up with roughly 200,000 rows. But I am just thinking whether theoretically, this is the best way to save this dictionary for data analysis purpose.
What if in the future I use 100,000 or 1,000,000 queries? How will you go about solving this problem? Will you use a different data structure from the beginning? and how will you make it ready for data analysis?