I want to extract the subject of a spam email from a json file but the subject could be anywhere in the file, within the 'content' or 'header' or 'body' . Using regex, I am unable to extract the subject despite using this below code:Could someone point to what is incorrect in the below regex or code?
import re
import json
with open("test.json", 'r') as fp:
json_decode = json.loads(fp.read())
p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|re\[\d+\]?) *([-:;)\]][ :;\])-]*|$)|\]+ *$', re.IGNORECASE)
for line in json_decode:
print(p.sub('', line).strip())
OUTPUT(incorrect) : body
My test.json file is this:
{'attachment': [{'content_header': {'content-disposition': ['attachment; '
'filename="image006.jpg"'],
'content-id': ['<image006.jpg@01D35D21.756FEE10>']
'body': [{'content': ' \n'
' \n'
'From: eCard Delivery [mailto:ecards@789greeting.com] \n'
'Sent: Monday, November 13, 2017 9:14 AM\n'
'To: Zhang, Jerry (352A-Affiliate) '
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!\n'
' \n'
' \tDear Jerry,\n'
'header': {'date': '2017-11-14T08:20:42-08:00',
'header': {'accept-language': ['en-US'],
'content-language': ['en-US'],
'content-type': ['multipart/mixed; '
'boundary="--boundary-LibPST-iamunique-1500317751_-_-"'],
'date': ['Tue, 14 Nov 2017 08:20:42 -0800']
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!'}}
^ Above here is the right format of the json file.