-1

I want to extract the subject of a spam email from a json file but the subject could be anywhere in the file, within the 'content' or 'header' or 'body' . Using regex, I am unable to extract the subject despite using this below code:Could someone point to what is incorrect in the below regex or code?

import re
import json
with open("test.json", 'r') as fp:
    json_decode = json.loads(fp.read())


p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|re\[\d+\]?) *([-:;)\]][ :;\])-]*|$)|\]+ *$', re.IGNORECASE)
for line in json_decode:
    print(p.sub('', line).strip())

OUTPUT(incorrect) : body

My test.json file is this:

    {'attachment': [{'content_header': {'content-disposition': ['attachment; '
                                                        'filename="image006.jpg"'],
                                'content-id': ['<image006.jpg@01D35D21.756FEE10>']
     'body': [{'content': ' \n'
                  ' \n'
                  'From: eCard Delivery [mailto:ecards@789greeting.com] \n'
                  'Sent: Monday, November 13, 2017 9:14 AM\n'
                  'To: Zhang, Jerry (352A-Affiliate) '

                  'Subject: Warmest Wishes! You have a Happy Thanksgiving '
                  'ecard delivery!\n'
                  ' \n'
                  ' \tDear Jerry,\n'
     'header': {'date': '2017-11-14T08:20:42-08:00',

        'header': {'accept-language': ['en-US'],
                   'content-language': ['en-US'],
                   'content-type': ['multipart/mixed; '
                                    'boundary="--boundary-LibPST-iamunique-1500317751_-_-"'],
                   'date': ['Tue, 14 Nov 2017 08:20:42 -0800']
                   'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
                   'ecard delivery!'}}

^ Above here is the right format of the json file.

py_noob
  • 433
  • 2
  • 8
  • 17
  • The contents of the `test.json` shown in your question isn't in valid JSON syntax — so I doubt that's actually what's in it. – martineau Mar 14 '19 at 17:33
  • I removed some text as I wasn't supposed to share specific emails and names but the format is unchanged. – py_noob Mar 14 '19 at 17:39
  • @martineau added the valid json file the way it is. – py_noob Mar 14 '19 at 17:52
  • If that's the contents of your file, then the `json.loads()` would fail. – martineau Mar 14 '19 at 19:52
  • Sooo, tbh, you approached it going wayyyy left. If you're trying to find things related to `s/Subject` have `ubject` somewhere in the regex. You can also use something like `'([\'|\"][\S\s]+?[\'|\"])(?=\s|$)'` to capture things inside of the quotes. I give a more precise solution below – FailSafe Mar 14 '19 at 23:29
  • Something like this should work and is pretty short and to the point. `([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]*)(?=\n|$)` – FailSafe Mar 14 '19 at 23:44

1 Answers1

0

Alrighty - So now given the fact that you original JSON file may not contain newline characters I'm hoping this works, and may even be more accurate

>>> string = '''{'attachment': [{'content_header': {'content-disposition': ['attachment; ''filename="image006.jpg"'],'content-id': ['<image006.jpg@01D35D21.756FEE10>'] 'body': [{'content': ' '' ''From: eCard Delivery [mailto:ecards@789greeting.com] ''Sent: Monday, November 13, 2017 9:14 AM''To: Zhang, Jerry (352A-Affiliate) ''Subject: Warmest Wishes! You have a Happy Thanksgiving ''ecard delivery!'' ''   Dear Jerry,' 'header': {'date': '2017-11-14T08:20:42-08:00','header': {'accept-language': ['en-US'], 'content-language': ['en-US'], 'content-type': ['multipart/mixed; ''boundary="--boundary-LibPST-iamunique-1500317751_-_-"'], 'date': ['Tue, 14 Nov 2017 08:20:42 -0800'] 'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving ' 'ecard delivery!'}}'''

>>> subjects_test = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]+)(?=\n|$|\s|\})', string)


>>> for subject in subjects_test:
        print(subject)



#OUPUT: #Kind of off I guess, but I don't know the full format of the file so this is the safest bet    

''Subject: Warmest Wishes! You have a Happy Thanksgiving ''ecard delivery!''
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '

Edit - Given your comment below, using the String you supplied above. Hopefully I'm understanding your requirements. I use both regex samples I provided.

>>> string = '''{'attachment': [{'content_header': {'content-disposition': ['attachment; '
                                                    'filename="image006.jpg"'],
                            'content-id': ['<image006.jpg@01D35D21.756FEE10>']
 'body': [{'content': ' \n'
              ' \n'
              'From: eCard Delivery [mailto:ecards@789greeting.com] \n'
              'Sent: Monday, November 13, 2017 9:14 AM\n'
              'To: Zhang, Jerry (352A-Affiliate) '

              'Subject: Warmest Wishes! You have a Happy Thanksgiving '
              'ecard delivery!\n'
              ' \n'
              ' \tDear Jerry,\n'
 'header': {'date': '2017-11-14T08:20:42-08:00',

    'header': {'accept-language': ['en-US'],
               'content-language': ['en-US'],
               'content-type': ['multipart/mixed; '
                                'boundary="--boundary-LibPST-iamunique-1500317751_-_-"'],
               'date': ['Tue, 14 Nov 2017 08:20:42 -0800']
               'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
               'ecard delivery!'}}'''



>>> subjects_test_1 = re.findall('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string)


>>> for subject in subjects_test_1:
        print(subject)

#OUPUT: 
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '


########################################################

>>> subjects_test_2 = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]*)(?=\n|$)', string)


>>> for subject in subjects_test_2:
        print(subject)

#OUPUT: 
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '

.

Or try this function:

For the line where you call the function, replace 'PATH_TO_YOUR_FILE' with... you know, the path to your file ...

>>> def email_subject_parse(file_path):
        import re
        email_subjects = []
        try:
            with open(file_path) as file:
                string = file.read()
                email_subjects = re.findall('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string)
                #Or less complicated 
                #email_subjects = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]*)(?=\n|$)', string)
                return email_subjects
        except:
            print('You have likely provided a bad file path')


>>> subjects = email_subject_parse('PATH_TO_YOUR_FILE')


>>> for subject in subjects:
        print(subject)



#OUPUT: 
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
FailSafe
  • 482
  • 4
  • 12
  • This doesn't work for me. I used the regex and below is my code: `filename = "45.json" with open(filename) as file: string = file.read() email_subjects = re.findall('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string) print(email_subjects[0])` and this prints out the entire json as one dict with the subject line 'Subject: Warmest Wishes! You have a Happy Thanksgiving ' at the beginning of the dictionary. – py_noob Mar 15 '19 at 15:33
  • Hmmm... That shouldn't be possible given the format of the file you posted. I edited in 2 other examples using what you provided as a string. let me know what your output is for those. – FailSafe Mar 15 '19 at 16:15
  • Yes your example works perfectly, I just tested it out. However it looks like when I use it with the complete json file ( posted below) it just outputs the whole file as a string :( – py_noob Mar 15 '19 at 17:40
  • I cannot post the json file here due to character limitation – py_noob Mar 15 '19 at 17:43
  • Hmmm... it might be because the original JSON File has no line breaks. If that's the case I'll make another edit. One second – FailSafe Mar 15 '19 at 18:00
  • I made the edit a while back. Let me know if the new version works – FailSafe Mar 16 '19 at 00:00
  • I missed this updated edit . I tried it out but it still prints out the entire json file as a string. I used this to print out the first line of subject it encounters: `re.search('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string)` and it works fine except the string is always truncated and I am trying to fix that - this is how the output appears : `<_sre.SRE_Match object; span=(428, 2733), match='Subject: telepathy\\nTo: \\"Moustakas, Leonidas A>` – py_noob Mar 18 '19 at 18:09
  • Not sure how to get rid of the `<_sre.SRE_Match object; span=(428, 2733), match=` – py_noob Mar 18 '19 at 18:10
  • That's so odd. Hmmm, are you willing to send me the file, or upload it to sendspace or github? – FailSafe Mar 18 '19 at 22:15
  • do you have an email that you can share here? I can't share on github unfortunately. – py_noob Mar 20 '19 at 15:16
  • Do you feel comfortable uploading it to sendspace? – FailSafe Mar 20 '19 at 22:58