0

I'm new to parsing json strings for information. I used json.loads to analyse a block of text, but I'm having trouble figuring out how to get just the Titles.

Here's the code:

from alchemyapi import AlchemyAPI
import json

alchemyapi = AlchemyAPI()

def run_alchemy_api(articleurl):
    response = alchemyapi.entities('url',articleurl, { 'showSourceText':1, 'sourceText':'xpath', 'xpath':'//*[contains(@class,"title may-blank")][1]' })
    if response['status'] == 'OK':
        print('## Response Object ##')
        print(json.dumps(response, indent=4))
        json_string = json.dumps(response, indent=4)
        titles = json.loads(json_string)
        print('This is the decode test,')
        print titles # <---- this is what I want to organize into a list
    else:
        print('Error in entity extraction call: ', response['statusInfo'])

run_alchemy_api('http://www.reddit.com/r/worldnews/')

I just want to parse the u'text' category, but this is a partial list of the output:

{u'status': u'OK', u'language': u'english', u'text': u'Lego is now the world\u2019s largest toymaker, as kids choose bricks over Barbie\n\nAfter convincing China to give up shark fin soup, Yao Ming sets out to save Africa\'s elephants from the ivory trade\n\nThree top ISIS lieutenants killed in US bombing raid\n\nAnonymous Really Wants a Cyberwar with the Islamic State\n\nBP found \'grossly negligent\' in 2010 Gulf oil spill\n\nA group of indigenous people in Brazil\'s Amazon region have detained and expelled loggers working illegally in their ancestral lands.\n\nAnti-ISIS flag-burning campaign launched by a trio of fearless Lebanese teens have ignited an Internet anti-terror sensation\n\nNova Scotia to ban fracking\n\nWHO and others criticised by numerous experts for misleading the public by publishing an ignorant and alarmist report into E-Cigarettes.\n\nRussia warns NATO not to offer membership to Ukraine\n\nKorean 20 year old dies in military service after a month of systematic beating, military is accused of covering up bullying\n\nNATO Chief to Russia: Pull Troops From Ukraine\n\nLarge asteroid to pass "very close" to Earth on Sunday\n\nNew dinosaur discovered! Ancient behemoth: Meet Dreadnoughtus, a supermassive dino\n\nThe U.N. nuclear watchdog said it has seen releases of steam and water indicating that North Korea may be operating a reactor, in the latest update on a plant that experts say could make plutonium for atomic bombs.\n\nWorld-first experiment achieves direct brain-to-brain communication in human subjects\n\nNATO allies to supply Ukraine with lethal military equipment\n\nUS doctor infected with Ebola heading to Nebraska\n\nNorth Korea\'s suicide rate among worst in world, says WHO report\n\nIslamic State Using Leaked Snowden Info To Evade Intelligence - U.S. Military Official Said Most Mid-Level And High-Ranking Islamic State Operators Have Virtually Disappeared, Giving No Hint As To Their Whereabouts Or Actions.\n\nEbola epidemic in West Africa is outpacing current responses.\u201cThe window of opportunity to stop Ebola from spreading widely throughout Africa and becoming a global threat for years to come is closing, but it is not yet closed,\u201d\n\nGrim Ebola Prediction: Outbreak Is Unstoppable for Now, MD Says\n\nFor the first time, scientists glimpse inside the cosmic nursery to see baby planets form\n\nCanadian beekeepers sue Bayer, Syngenta over neonicotinoid pesticides for over $400 million\n\nUkraine army on alert to repel possible rebel attack near Mariupol - military source', u'entities': [{u'relevance': u'0.803767', u'count': u'4', u'type': u'Country', u'text': u'Ukraine'}, {u'relevance': u'0.671762', u'count': u'3', u'type': u'Organization', u'disambiguated': {u'website': u'http://www.natoonline.org/', u'yago': u'http://yago-knowledge.org/resource/National_Association_of_Theatre_Owners', u'name': u'National Association of Theatre Owners', u'freebase': u'http://rdf.freebase.com/ns/m.031hx_', u'subType': [], u'dbpedia': u'http://dbpedia.org/resource/National_Association_of_Theatre_Owners'}, u'text': u'NATO'}, {u'relevance': u'0.564646', u'count': u'3', u'type': u'HealthCondition', u'text': u'Ebola'}, {u'relevance': u'0.543892', u'count': u'3', u'type': u'Region', u'text': u'West Africa'}, {u'relevance': u'0.521051', u'count': u'2', u'type': u'FieldTerminology', u'text': u'military equipment'}, {u'relevance': u'0.491148', u'count': u'2', u'type': u'Country', u'disambiguated': {u'website': u'http... and so on

How do I go about just extracting the u'text' titles into something like this?

articles = [Lego is now the world\u2019s largest toymaker, as kids choose bricks over Barbie, After convincing China to give up shark fin soup, Yao Ming sets out to save Africa\'s elephants from the ivory trade ... etc.]

3 Answers3

1

It looks like your titles in text are splited by two new lines (unix style). So you have to extract the text key from your response dict (don't convert it to json and back to python) and split that into it titles.

text = response['text']
titles = text.split('\n\n')
semptic
  • 645
  • 4
  • 15
  • So close! Now I get `[u'Lego is now the world\u2019s largest toymaker, as kids choose bricks over Barbie', u"After convincing China to give up shark fin soup, Yao Ming sets out to save Africa's elephants from the ivory trade", u'Three top ISIS lieutenants killed in US bombing raid', u'Anonymous Really Wants a Cyberwar with the Islamic State', ` – Phillipe Dongwoo Han Sep 05 '14 at 08:57
  • What exactly do you want? This? `[u'Lego is now the world\u2019s largest toymaker", u"as kids choose bricks over Barbie', u"After convincing China to give up shark fin soup", u"Yao Ming sets out to save Africa's elephants from the ivory trade", u'Three top ISIS lieutenants killed in US bombing raid', u'Anonymous Really Wants a Cyberwar with the Islamic State', ...]` – semptic Sep 05 '14 at 09:01
  • I want to get rid of the u'. Not sure why that random u" is there either. So: [Lego is now the world\u2019s largest toymaker, as kids choose bricks over Barbie', After convincing China to give up shark fin soup, Yao Ming sets out to save Africa's elephants from the ivory trade", etc..] – Phillipe Dongwoo Han Sep 05 '14 at 09:02
  • The u indicates that this is not a "normal" string its a Unicode string. Look at this for nearer explanation: http://stackoverflow.com/a/11279428/3764701 or https://docs.python.org/3.4/howto/unicode.html – semptic Sep 05 '14 at 09:04
  • Thanks for the explanation. Is there a way I can convert that list into a list of normal strings? – Phillipe Dongwoo Han Sep 05 '14 at 09:25
  • For the most stuff you can do Unicode is the better way to store strings. To convert it to ascii you can use `u"foo bar".encode('ascii')`. To get a better understanding I recommend to read the python docs (https://docs.python.org/2/howto/unicode.html) – semptic Sep 05 '14 at 11:15
0

After parsing json, you need manually extract text like this:

json.loads(json_string).get('text')

If you working with huge json files, try to use iterative JSON parser - ijson

Alex Lisovoy
  • 5,767
  • 3
  • 27
  • 28
  • Its just a dict and standard access is `json.loads(json_string)['text']` – tdelaney Sep 05 '14 at 08:38
  • Thanks but it went back into original format like this '[u'Lego is now the world\u2019s largest toymaker, as kids choose bricks over Barbie\n\nAfter convincing China to give up shark fin soup, Yao Ming sets out to save Africa\'s elephants from the ivory trade\n\n...]' I'm trying to get [Title 1, Title 2, etc.] – Phillipe Dongwoo Han Sep 05 '14 at 08:42
0

Response is a python dict and 'text' is one of its keys. Just use it. There are many ways to make a list. One is to pass a list in and add the title on success.

def run_alchemy_api(articleurl, article_list):
    response = alchemyapi.entities('url',articleurl, { 'showSourceText':1, 'sourceText':'xpath', 'xpath':'//*[contains(@class,"title may-blank")][1]' })
    if response['status'] == 'OK':
        print(response['text'])
        article_list.append(response['text'])
    else:
        print('Error in entity extraction call: ', response['statusInfo'])


urls = [ 'url1', ...]
titles = []
for url in urls:
    run_alchmy_api(url, titles)
tdelaney
  • 73,364
  • 6
  • 83
  • 116