0

I've been working on this for about three days and am as lost as I can be. I've created a script that downloads a single message from Gmail using the Google API. I need to extract the To, CC, and BCC addresses from that message and store them in a list. I'll eventually need to process a large number of messages but I cannot even extract those values from one message. The Gmail JSON object is a mix of dict and list objects:

msg (dict-8)  
--historyID (str-1)  
--id (str-1)  
--internalDate (str-1)  
--labelIds (list-1)  
--payload (dict-2)  
-- --headers (list-1)  
-- -- --unnamed index 0 (dict-2)  
-- -- -- --name:To (str-1)  
-- -- -- --value:gself@gmail.com (str-1)  
-- --mimeType (str-1)  
--sizeEstimate (int-1)  
--snippet (str-1)  
--threadId (str-1)  

For my project, I need the value of the 'To' address (I will also eventually want the CC and BCC data, but can apply whatever works for 'To' to find those values). An early effort was to simply extract that value by navigating to it using something like "msg['payload']['headers'][0]['value']". That works fine for this one message, but the JSON structure doesn't seem to be consistent and the index number for 'To' in the headers list is unpredictable. So I need to find a way to search for all 'name:To' keys and extract the value item from that list element. I'm in way over my head on this.

I've tried several different JSON functions in various packages without luck. I looked at Pandas and think that there may be some hope there, but nothing that I could figure out. I tried a simple REGEX search but I can't search a dict object. I tried flattening the dict but that didn't seem to give me much help (even when flat, the 'To' lines contains the index number so the key is somewhat unpredictable). I tried various for loops but found it difficult to iterate down through the levels. I've tried several different iterators I found online but they did not seem to work for me, though I suspect that I simply did not know what I was doing.

The only potential solution I came up with was dumping the dict to a variable using json.dumps and then doing a regex search for email addresses in that variable. While I think that should work, it strikes me that there must be a more direct solution than creating a variable and searching in that variable.

Is there a package that would help me extract a buried element (email addresses) from a list in a Gmail JSON object? Maybe I could search for email addresses that appear in any of the values, but I'm not sure how to search down three levels in the structure. Maybe someone has developed a function that can search through a JSON object. Maybe there is another solution and I just don't have enough experience to craft on my own. I'm deeply grateful for any help I can get on this.

Solution

I think I found a solution and owe the community an apology for [mountain]=[molehill]. It turns out that the only part of the JSON object that changes is the length of the 'Headers' list and I can easily walk through that list and store all of the email addresses with this:

for getAddr in msg['payload']['headers']:
    msgAddr += (getAddr['value'])

A two-line solution for a three-day problem. Now I'll slink back to my cave...

George Self
  • 67
  • 1
  • 6
  • 1
    it might help to post the object in actual JSON format, as well as multiple examples, since you say it's inconsistent – Cody Dec 01 '18 at 06:05
  • The technique in my [answer](https://stackoverflow.com/a/14059645/355230) to a related JSON question would probably work. – martineau Dec 01 '18 at 09:31
  • I second @Cody's motion. If you [edit] your question do so, I could show you how to used my previouly linked answer to do it. – martineau Dec 01 '18 at 09:39

0 Answers0