1

I want to extract JSON/dictionary from a log text.

The Sample log text:

2018-06-21 19:42:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'locations', 'CLOSESPIDER_TIMEOUT': '14400', 'FEED_FORMAT': 'geojson', 'LOG_FILE': '/geojson_dumps/21_Jun_2018_07_42_54/logs/coastalfarm.log', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'locations.spiders', 'SPIDER_MODULES': ['locations.spiders'], 'TELNETCONSOLE_ENABLED': '0', 'USER_AGENT': 'Mozilla/5.0'}

2018-06-21 19:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

2018-06-21 19:43:00 [scrapy.core.engine] INFO: Spider closed (finished)

I have tried (\{.+$\}) as the regex expression but it gives me the the dict which is on single line, {'BOT_NAME': 'locations',..., 'USER_AGENT': 'Mozilla/5.0'} which is not what is expected.

The json/dictionary I want to extract: Note: The dictionary would not the same keys, it could differ.

{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}
Dustin Oprea
  • 9,673
  • 13
  • 65
  • 105
vdkotian
  • 539
  • 6
  • 13
  • If you can extract the correct string from the log, then just parse it using json module, see https://stackoverflow.com/questions/4917006/string-to-dictionary-in-python , you will get dictionary object. – dr_agon Jul 01 '18 at 16:05

2 Answers2

2

Edit: The JSON spans multiple lines so here's what will do it:

import re

re_str = '\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[scrapy\.statscollectors] INFO: Dumping Scrapy stats:.({.+?\})'
stats_re = re.compile(re_str, re.MULTILINE | re.DOTALL)

for match in stats_re.findall(log):
    print(match)

If you are after only the lines from the statscollector then this should get you there (assuming that it's all on one line too):

^.*?\[scrapy.statscollectors] INFO: Dumping Scrapy stats: (\{.+$\}).*?$
JohnKlehm
  • 2,368
  • 15
  • 9
  • Put several lines of the log in pastebin somewhere and I'll take a look. – JohnKlehm Jul 01 '18 at 15:10
  • Remove the `$` within the group, and `?` are not nieeded. The correct expression would be: `^.*\[scrapy\.statscollectors] INFO: Dumping Scrapy stats: (\{.+\}).*$` – dr_agon Jul 01 '18 at 15:58
  • mul_line_json = re.compile('^.*\[scrapy\.statscollectors] INFO: Dumping Scrapy stats: (\{.+\}).*$', re.MULTILINE) re.findall(mul_line_json, data) still no output – vdkotian Jul 01 '18 at 16:18
  • I've edited my answer with code that works with the pastbin. – JohnKlehm Jul 01 '18 at 16:21
0

Using a JSON tokenizer makes this a very simple and efficient task, as long as you have an anchor to search for in the original document that allows you to at least identify the beginning of the JSON blob. This uses json-five to extract JSON from HTML:

import json5.tokenizer

with open('5f32d5b4e2c432f660e1df44.html') as f:
    document = f.read()

search_for = "window.__INITIAL_STATE__="
i = document.index(search_for)
j = i + len(search_for)
extract_from = document[j:]

tokens = json5.tokenizer.tokenize(extract_from)
stack = []
collected = []
for token in tokens:
    collected.append(token.value)

    if token.type in ('LBRACE', 'LBRACKET'):
        stack.append(token)
    elif token.type in ('RBRACE', 'RBRACKET'):
        stack.pop()

    if not stack:
        break

json_blob = ''.join(collected)

Note that this accounts for the JSON both being a complex (object, list) or scalar type.

Dustin Oprea
  • 9,673
  • 13
  • 65
  • 105