0

I am working on converting a dictionary to pandas dataframe - but dataframe does not show columnar format I would like to be like Image1 (another example that I did), but the data shows up as Image2.

In the 1st example (Image1) I was using a single URL for a news source In the 2nd example (Image2) I have a for loop to parse multiple URLs for news sources.

I also see that my dictionary in the 2nd example has 2 "[]" unlike the first one that has a single []

I can provide any more details. Please help me if you can.

Thank you all in advance.

Image1 - dictionary to pandas dataframe output shows up fine

Image2 - dictionary to pandas dataframe output DOES NOT shows up fine

enter code here

extractEntities function code here:

def extractEntities(url):
    endpoint_watson = "https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze"
    params = {
        'version': '2020-09-12',
    }
    headers = { 
        'Content-Type': 'application/json',
    }
    watson_options = {
      "url": url,
      "features": {
        "entities": {
          "sentiment": True,
          "emotion": True,
          "limit": 100
        }
      }
    }
    username = "apikey"
    password = "<<myAPIKeyinfo>>"

    resp = requests.post(endpoint_watson, 
                         data=json.dumps(watson_options), 
                         headers=headers, 
                         params=params, 
                         auth=(username, password) 
                        )
    results = resp.json()
    article_dict = []
    if "entities" in results:
      for i in results.get('entities'):
        initial_dict = {}
        initial_dict['entity'] = i['text']
        initial_dict['url'] = url
        initial_dict['source'] = url.split('.')[1]
        initial_dict['relevance'] = i['relevance']
        initial_dict['sentiment'] = i['sentiment']['score']
        article_dict.append(initial_dict)

      return article_dict

Then I extract some news entities

s3 = 'the-wall-street-journal'
allurls3 = []
allurls3 = getNews(s3)
allurls3

And below is the code that calls the extractEntities function. It also contains another for loop:

dict1 = []
for u in range(len(allurls3)):
  data3 = []
  url3 = allurls3[u]
  data3 = extractEntities(url3)
  dict1.append(data3)
dict1
espeva
  • 3
  • 2

1 Answers1

0

Thanks for posting the code. In the future, please do not upload images of code/errors when asking a question. and try to make it a Minimal, Reproducible Example. I don't have a Watson API key, so I couldn't reproduce your example completely, but what is does is basically the following:

In extractEntities(url) you make an API call to Watson NLP service and for each entity found in the response you create a dictionary with the relevance, sentiment and so on. In the end you return a list of all those dictionaries. Let's make a dummy function to simulate this, based on the code you provided, so that I can try to reproduce the problem you are having.

import random
import pandas as pd

def extractEntities(url):
  article_dict = [] # actually a list, not a dict!!
  for entity in ('Senate', 'CNN', 'Hillary Clinton', 'Bill Clinton'):
      initial_dict = {}
      initial_dict['entity'] = entity
      initial_dict['url'] = url
      initial_dict['source'] = url.split('.')[1]
      initial_dict['relevance'] = random.random()
      initial_dict['sentiment'] = random.random()
      article_dict.append(initial_dict)
  return article_dict # returns a list of dictionaries

Sample output is a list of dictionaries:

>>> extractEntities('https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html')
[{'entity': 'Senate',
  'relevance': 0.4000160139190754,
  'sentiment': 0.012884391182820587,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'CNN',
  'relevance': 0.44921272670354884,
  'sentiment': 0.40996636370319894,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'Hillary Clinton',
  'relevance': 0.4892046288027784,
  'sentiment': 0.5424038672663258,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'Bill Clinton',
  'relevance': 0.7237361288162582,
  'sentiment': 0.8269245953553733,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'}]

Now you have a list of URLs in allurls3 and do the following:

  • You create an empty list called very confusingly dict1
  • You loop over the URLs in allurls3
  • Call extractEntities on that URL, data3 now holds a list of dictionaries (see above)
  • Append that list of dictionaries to the list dict1. The end result dict1 is a list of lists of dictionaries:
    >>> allurls3 = ['https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html', 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305']
    >>>> dict1 = []
    >>>> for u in range(len(allurls3)):
    >>>      data3 = []
    >>>      url3 = allurls3[u]
    >>>      data3 = extractEntities(url3)
    >>>      dict1.append(data3)
    >>> dict1
    [[{'entity': 'Senate',
       'relevance': 0.19115763152061027,
       'sentiment': 0.557935869111337,
       'source': 'cnn',
       'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
      {'entity': 'CNN',
       'relevance': 0.9259134250004917,
       'sentiment': 0.8605677705216526,
       'source': 'cnn',
       'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
      {'entity': 'Hillary Clinton',
       'relevance': 0.6071084891165042,
       'sentiment': 0.04296592154310419,
       'source': 'cnn',
       'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
      {'entity': 'Bill Clinton',
       'relevance': 0.9558183603396242,
       'sentiment': 0.42813857092335783,
       'source': 'cnn',
       'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'}],
     [{'entity': 'Senate',
       'relevance': 0.5060582500660554,
       'sentiment': 0.9240451580369043,
       'source': 'wsj',
       'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
      {'entity': 'CNN',
       'relevance': 0.03956002547473547,
       'sentiment': 0.5337343576461046,
       'source': 'wsj',
       'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
      {'entity': 'Hillary Clinton',
       'relevance': 0.6706912125534789,
       'sentiment': 0.7721987482202004,
       'source': 'wsj',
       'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
      {'entity': 'Bill Clinton',
       'relevance': 0.37377943134631464,
       'sentiment': 0.7114485187747178,
       'source': 'wsj',
       'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'}]]

And finally you wrap this list of lists of dictionaries dict1 in another list to turn it into a pandas DataFrame.

>>> pd.set_option('max_colwidth', 800)
>>> articles_df1 = pd.DataFrame([dict1])
>>> articles_df1

enter image description here

OK, so now I have been able to reproduce your error, I can tell you how to fix it. You know from the first image you posted that you need to provide pd.DataFrame with a list of dictionaries, not with a list of a list of lists of dictionaries as you are doing now.

Also, naming a list dict1 is very confusingly. So instead do the following. The key difference is to use extend instead of append.

>>> entities = []
>>> for url3 in allurls3:
>>>     data3 = extractEntities(url3)
>>>     entities.extend(data3)
>>> entities
[{'entity': 'Senate',
  'relevance': 0.11594421982738612,
  'sentiment': 0.2917557430217993,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'CNN',
  'relevance': 0.5741596155387597,
  'sentiment': 0.7743716765722405,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'Hillary Clinton',
  'relevance': 0.2535272395046557,
  'sentiment': 0.2570270764910251,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'Bill Clinton',
  'relevance': 0.2275111369786037,
  'sentiment': 0.03312536097047081,
  'source': 'cnn',
  'url': 'https://us.cnn.com/2020/09/15/politics/donald-trump-biden-retweet/index.html'},
 {'entity': 'Senate',
  'relevance': 0.8197309413723833,
  'sentiment': 0.9492436947284604,
  'source': 'wsj',
  'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
 {'entity': 'CNN',
  'relevance': 0.7317312596198684,
  'sentiment': 0.5052344447199512,
  'source': 'wsj',
  'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
 {'entity': 'Hillary Clinton',
  'relevance': 0.3572239446181651,
  'sentiment': 0.056131606725058014,
  'source': 'wsj',
  'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'},
 {'entity': 'Bill Clinton',
  'relevance': 0.761777835912902,
  'sentiment': 0.28138007550393573,
  'source': 'wsj',
  'url': 'https://www.wsj.com/articles/hurricane-sally-barrels-into-alabama-11600252305'}]

Now you have a list of dictionaries that you can use to create a DataFrame:

>>> pd.set_option('max_colwidth', 800)
>>> articles_df1 = pd.DataFrame(entities)
>>> articles_df1

enter image description here

BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • Thank you very much @biogeek. My for loop is below.. would you please guide me on what modification is possible for the loop to avoid the list within a list `dict1 = [] for u in range(len(allurls3)): data3 = [] url3 = allurls3[u] data3 = extractEntities(url3) dict1.append(data3) dict1` – espeva Sep 16 '20 at 14:10
  • 1) Please update your question with the code of your for loop. Posting it in a comment removes essential formatting. 2) This is a public forum, we don't do private reviews. Post your code here so that others in the future can learn from your problem and the answer to it as well. – BioGeek Sep 16 '20 at 18:38
  • 3) it seems you have a basic misunderstanding between what a `list` and a `dict` is. Or you have a strange sense of humor by using the variable `dict1` for an empty list `[]`. – BioGeek Sep 16 '20 at 18:39
  • 1
    Thanks for info @biogeek. Appreciate. I just posted my code – espeva Sep 16 '20 at 19:54
  • Thank you @biogeek. very helpful suggestions. and all guidance noted with thanks. I get a TypeError: 'NoneType' object is not iterable message on the line entities.extend(data3) – espeva Sep 17 '20 at 13:46
  • Carefully check all your types. You are trying to iterate over an object, which should probably be something like a list, but it is actually `None`. – BioGeek Sep 17 '20 at 14:03
  • is it possible to paste my full file (e.g. the .IPYNB file here, or is that frowned upon? – espeva Sep 17 '20 at 15:06
  • 1
    thank you for your help. I had an indentation error. that was causing some issues.. fixed now. thanks a million – espeva Sep 17 '20 at 16:52