0

The problem is to save a dictionary for data analysis so that it will scale. I am performing 10000 search and based on the results I am saving a dictionary for every query. Finally, I end up with a dictionary like the following:

{
'query_1' : {'has_result': True (or False),
             'direct_result': True (or False),
             'title': "title_1",
             'summary': "summary_1",
             'infobox': {'header_11': "data_11",
                         'header_12': "data_12",
                          .
                          .
                          .
              }
'query_2' : {'has_result': True (or False),
             'direct_result': True (or False),
             'title': "title_2",
             'summary': "summary_2",
             'infobox': {'header_21': "data_21",
                         'header_22': "data_22",
                          .
                          .
                          .
              }
.
.
.
}

The problematic part is obviously 'infobox'. I have no idea how many key-value pair I will get for each 'infobox' (usually not more than 50). And the keys are expected to be different for each infobox.

Right now, I can only think of the following way to save the data as a csv.

+---------+------------+---------------+---------+-----------+----------------+--------------+
|  query  | has_result | direct_result |  title  |  summary  | infobox_header | infobox_data |
+---------+------------+---------------+---------+-----------+----------------+--------------+
| query_1 | TRUE       | TRUE          | title_1 | summary_1 | header_1       | data_1       |
| query_1 | TRUE       | TRUE          | title_1 | summary_1 | header_2       | data_2       |
| query_1 | TRUE       | TRUE          | title_1 | summary_1 | header_3       | data_3       |
| query_1 | TRUE       | TRUE          | title_1 | summary_1 | header_4       | data_4       |
| query_1 | TRUE       | TRUE          | title_1 | summary_1 | header_5       | data_5       |
| query_2 | TRUE       | FALSE         | title_2 | summary_2 | header_1       | data_1       |
| query_2 | TRUE       | FALSE         | title_2 | summary_2 | header_2       | data_2       |
| query_2 | TRUE       | FALSE         | title_2 | summary_2 | header_3       | data_3       |
| query_2 | TRUE       | FALSE         | title_2 | summary_2 | header_4       | data_4       |
+---------+------------+---------------+---------+-----------+----------------+--------------+

The problem with my solution is, 'title' and 'summary' is a string variable. For 10000 queries, this is not a big deal. I end up with roughly 200,000 rows. But I am just thinking whether theoretically, this is the best way to save this dictionary for data analysis purpose.

What if in the future I use 100,000 or 1,000,000 queries? How will you go about solving this problem? Will you use a different data structure from the beginning? and how will you make it ready for data analysis?

Imrul Huda
  • 181
  • 1
  • 6

1 Answers1

0

For data analysis with Python, your best option is likely to use a class. Thankfully, there are 3rd party libraries which provide this functionality, such as Pandas.

The below solution uses @MaxU's explode recipe.

import pandas as pd

# construct dataframe from dictionary of dictionaries, d
df = pd.DataFrame.from_dict(d, orient='index').rename_axis('query').reset_index()

# extract header & data, drop infobox
df['header'] = df['infobox'].map(list)
df['data'] = df['infobox'].map(lambda x: list(x.values()))
df = df.drop('infobox', 1)

# expand dataframe
res = explode(df, ['header', 'data'])

print(res)

     query  has_result  direct_result    title    summary     header     data
0  query_1        True          False  title_1  summary_1  header_11  data_11
1  query_1        True          False  title_1  summary_1  header_12  data_12
2  query_2       False           True  title_2  summary_2  header_21  data_21
3  query_2       False           True  title_2  summary_2  header_22  data_22

Choice of storage is a broad question and depends on your use cases, requirements, existing infrastructure, etc. In general, you may find Pickle and HDF5 adequate; with HDF5 providing portability benefits.

jpp
  • 159,742
  • 34
  • 281
  • 339