1

I am pretty new to bytes and pandas,

I have data like this,

But not sure how to convert into a dataframe.

data=[b"{'metricValue': 5.0, 'appMetadata':{'index': 'cfs_planum_metrics_debug_86188', 'host': 'iaasn00041949', 'job': 'splunk_scraper'}, 'timestampEpochSecond': 1544651948897, 'metricName': 'splunk_logs_tstats_count_per_min', 'metricType': 'count', 'metricTags': {'source': '/opt/splunk/etc/apps/PlanumComputeMetrics/bin/logs/DECOInstance2.log', 'query_timestamp': '2018-12-12T16:43:40.000-05:00'}}", b"{'metricValue': 4.0, 'appMetadata': {'index': 'cfs_digital_88082', 'host': 'dgt01p01tx5l046', 'job': 'splunk_scraper'}, 'timestampEpochSecond': 1544651948462, 'metricName': 'splunk_logs_tstats_count_per_min', 'metricType': 'count', 'metricTags': {'source': '/logs/apache24inst0/httpds0_access.log', 'query_timestamp': '2018-12-12T16:43:50.000-05:00'}}"] 

Thank you for the help

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
david
  • 45
  • 8

1 Answers1

1

Given the structure of your data

{
    'metricValue': 5.0,
    'appMetadata': {
        'index': 'cfs_planum_metrics_debug_86188',
        'host': 'iaasn00041949',
        'job': 'splunk_scraper'
    },
    'timestampEpochSecond': 1544651948897,
    'metricName': 'splunk_logs_tstats_count_per_min',
    'metricType': 'count',
    'metricTags': {
        'source': '/opt/splunk/etc/apps/PlanumComputeMetrics/bin/logs/DECOInstance2.log',
        'query_timestamp': '2018-12-12T16:43:40.000-05:00'
    }
}, {
    'metricValue': 4.0,
    'appMetadata': {
        'index': 'cfs_digital_88082',
        'host': 'dgt01p01tx5l046',
        'job': 'splunk_scraper'
    },
    'timestampEpochSecond': 1544651948462,
    'metricName': 'splunk_logs_tstats_count_per_min',
    'metricType': 'count',
    'metricTags': {
        'source': '/logs/apache24inst0/httpds0_access.log',
        'query_timestamp': '2018-12-12T16:43:50.000-05:00'
    }
}

you can

# Convert your data from list of bytes into a list of strings
list_of_string = list(map(lambda d: d.decode('utf-8'), data))

# Parse the list of strings into a list of dictionaries
from ast import literal_eval
list_of_dicts = list(map(literal_eval, list_of_string))

# Convert the list to a DataFrame
df = pd.DataFrame(list_of_dicts)

# Convert appMetadata to a DataFrame
app_metadata = pd.concat(df['appMetadata']
                         .apply(pd.DataFrame.from_dict, orient='index')
                         .apply(lambda x: x.T)
                         .to_dict()).reset_index(level=1, drop=True)

# Convert metricTags to a DataFrame
metric_tags = pd.concat(df['metricTags']
                        .apply(pd.DataFrame.from_dict, orient='index')
                        .apply(lambda x: x.T)
                        .to_dict()).reset_index(level=1, drop=True)

# Join everything back to the original DataFrame
df = df.join(app_metadata).drop('appMetadata', axis=1)
df = df.join(metric_tags).drop('metricTags', axis=1)

or, alternatively

# Flatten the dictionaries
def dict_flatten(d):
    for key in d:
        val = d[key]
        if isinstance(val, dict):
            for sub_key in val:
                yield sub_key, val[sub_key]
        else:
            yield key, val

flat_dicts = list(map(dict, map(dict_flatten, list_of_dicts)))

# Convert the list of flattened dictionaries to a DataFrame
df = pd.DataFrame(flat_dicts)

both resulting in (up to the order of the columns)

                         metricName metricType  metricValue  timestampEpochSecond       ...                      query_timestamp                           index             host             job
0  splunk_logs_tstats_count_per_min      count          5.0         1544651948897       ...        2018-12-12T16:43:40.000-05:00  cfs_planum_metrics_debug_86188    iaasn00041949  splunk_scraper
1  splunk_logs_tstats_count_per_min      count          4.0         1544651948462       ...        2018-12-12T16:43:50.000-05:00               cfs_digital_88082  dgt01p01tx5l046  splunk_scraper
ayorgo
  • 2,803
  • 2
  • 25
  • 35
  • Thank you Ayorgo, really appreciate it. there is a "b" on my data, not sure it will impact the code you provided. – david Dec 13 '18 at 19:16
  • No worries, @david. Yes, I'm aware of the `b`, that shouldn't affect the solution. Just wanted to pretty print it so the structure is apparent. – ayorgo Dec 13 '18 at 19:17
  • Hi Ayorgo, thank you very much. when I type df.head(), because the data is have 500+ value inside. I got a error shows object of type 'float' has no len() – david Dec 13 '18 at 20:05
  • It works, you are right Ayoro. Really appreciate your help. – david Dec 13 '18 at 20:58