0

Im trying to pull just the keywords from an xml output like shown on:

http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a

I have tried putting together the below but i don't seem to get any errors or any output. Any ideas?

import urllib2 as ur
import re

f = ur.urlopen(u'http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a')
res = f.readlines()
for d in res:
  data = re.findall('<CompleteSuggestion><\/CompleteSuggestion>',d)
  for i in data:
    print i
    file = open("keywords.txt", "a")
    file.write(i + '\n')
    file.close()

I am trying to,

  1. Fetch the xml from url given
  2. Store list of keywords from XML file, parsed using regex

Thanks,

Gurupad Hegde
  • 2,155
  • 15
  • 30
BubblewrapBeast
  • 1,507
  • 2
  • 15
  • 19
  • 1
    Did you check that the regex in findall works correctly (by setting some constant content into 'd') ?
    Also. try adding r before the regex string, e.g r'<\/CompleteSuggestion>')
    – Baruch Oxman Jun 03 '15 at 14:17
  • Hey Baruch, Im not that great at Regex. I'm guessing i did somthing wrong within the regex itself. – BubblewrapBeast Jun 03 '15 at 14:22
  • You should use one of the numerous XML libraries included in the Python standard library. – Iguananaut Jun 03 '15 at 14:30
  • possible duplicate of [How do I parse XML in Python?](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) – Iguananaut Jun 03 '15 at 14:31
  • (As an aside, you don't need to open the file and close it on every loop. Just open it once before your loops, write to it in the loop, and close it after all writing is finished) – Iguananaut Jun 03 '15 at 14:33
  • It is not clear about what are you trying to extract here. Can you post expected output for a sample input xml? – Gurupad Hegde Jun 03 '15 at 14:36
  • So i am looking to have python go to:http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a and give me the text such as: test anxiety, test america – BubblewrapBeast Jun 03 '15 at 14:44

1 Answers1

1
from urllib2 import urlopen 
import re

xml_url = u'http://clients1.google.com/complete/search?hl=en&output=toolbar&q=test+a'
xml_file_contents = urlopen(xml_url).readlines()

keywords_file = open("keywords.txt", "a")

for entry in xml_file_contents:
    output = "\n".join(re.findall('data=\"([^\"]*)',entry))
    print output
    keywords_file.write(output + '\n')

keywords_file.close()

output:

test anxiety
test america
test adobe flash
test automation
test act
test alternator
test and set
test adblock
test adobe shockwave
test automation tools

Let me know in case of any doubt

Gurupad Hegde
  • 2,155
  • 15
  • 30