2

i just go into coding and also coding in Python. Currently i'm working on a webcrawler. I need to save my data to a JSON file so i can export it into MongoDB.

import requests
import json
from bs4 import BeautifulSoup 

url= ["http://www.alternate.nl/html/product/listing.html?filter_5=&filter_4=&filter_3=&filter_2=&filter_1=&size=500&lk=9435&tk=7&navId=11626#listingResult"] 

amd = requests.get(url[0])
soupamd = BeautifulSoup(amd.content) 

prodname = [] 
adinfo = [] 
formfactor = []
socket = [] 
grafisch = []
prijs = []

a_data = soupamd.find_all("div", {"class": "listRow"}) 
for item in a_data: 
    try:
        prodname.insert(len(prodname),item.find_all("span", {"class": "name"})[0].text)
        adinfo.insert(len(adinfo), item.find_all("span", {"class": "additional"})[0].text)
        formfactor.insert(len(formfactor), item.find_all("span", {"class": "info"})[0].text)
        grafisch.insert(len(grafisch), item.find_all("span", {"class": "info"})[1].text)
        socket.insert(len(socket), item.find_all("span", {"class": "info"})[2].text)
        prijs.insert(len(prijs), item.find_all("span", {"class": "price right right10"})[0].text)
    except: 
        pass

I'm stuck at this part. I want to export the data that I saved in the arrays to a JSON file. This is what I have now:

file = open("mobos.json", "w")

for  i = 0:  
    try: 
        output = {"productnaam": [prodname[i]],
        "info" : [adinfo[i]], 
        "formfactor" : [formfactor[i]],
        "grafisch" : [grafisch[i]],
        "socket" : [socket[i]], 
        "prijs" : [prijs[i]]} 
        i + 1
        json.dump(output, file)
        if i == 500: 
            break
    except: 
        pass 

file.close()

So I want to create a dictionary format like this:

{"productname" : [prodname[0]], "info" : [adinfo[0]], "formfactor" : [formfactor[0]] .......}
{"productname" : [prodname[1]], "info" : [adinfo[1]], "formfactor" : [formfactor[1]] .......}
{"productname" : [prodname[2]], "info" : [adinfo[2]], "formfactor" : [formfactor[2]] .......} etc.
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
henktenk
  • 280
  • 1
  • 6
  • 19
  • 1
    You may want to read the Python tutorial on looping again, and on lists. Don't use `listobject.insert(len(listobject), ...)`, use `listobject.append(..)` for example, and why not add all information to **one** list (as dictionaries, for example), then just loop over than one list? You can use `for item in listobject:` and not need to index. – Martijn Pieters Nov 24 '14 at 10:31
  • And you *really* don't want to use `try...except` without specific exceptions; don't mask your errors like that. – Martijn Pieters Nov 24 '14 at 10:32

2 Answers2

4

Create dictionaries to begin with, in one list, then save that one list to a JSON file so you have one valid JSON object:

soupamd = BeautifulSoup(amd.content) 
products = []

for item in soupamd.select("div.listRow"):
    prodname = item.find("span", class_="name")
    adinfo = item.find("span", class_="additional")
    formfactor, grafisch, socket = item.find_all("span", class_="info")[:3]
    prijs = item.find("span", class_="price")
    products.append({
        'prodname': prodname.text.strip(),
        'adinfo': adinfo.text.strip(),
        'formfactor': formfactor.text.strip(),
        'grafisch': grafisch.text.strip(),
        'socket': socket.text.strip(),
        'prijs': prijs.text.strip(),
    })

with open("mobos.json", "w") as outfile:
    json.dump(products, outfile)

If you really want to produce separate JSON objects, one on each line, write newlines in between so you can at least find these objects back again (parsing is going to be a beast otherwise):

with open("mobos.json", "w") as outfile:
    for product in products:
        json.dump(products, outfile)
        outfile.write('\n')

Because we now have one list of objects, looping over that list with for is far simpler.

Some other differences from your code:

  • Use list.append() rather than list.insert(); there is no need for such verbose code when there is a standard method for the task.
  • If you are looking for just one match, use element.find() rather than element.find_all()
  • You really want to avoid using blanket exception handling; you'll mask far more than you want to. Catch specific exceptions only.
  • I used str.strip() to remove the extra whitespace that usually is added in HTML documents; you could also add an extra ' '.join(textvalue.split()) to remove internal newlines and collapse whitespace, but this specific webpage doesn't seem to require that measure.
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for the help! My output has some unicode charsets in it. Like this one: \u20ac. Is there a way to remove/replace that? – henktenk Nov 24 '14 at 11:37
  • @henkownz: all your output is Unicode; you mean you have non-ASCII characters. :-) You have the [U+20AC EURO SIGN](http://codepoints.net/U+20ac), properly escaped as JSON data, are you sure you want to get rid of those? You can always use explicit replacement (`str.replace()`) to remove those, or use `str.translate()` to remove multiple characters. Or you can use [`unidecode`](https://pypi.python.org/pypi/Unidecode) to replace anything non-ASCII with their closest ASCII equivalents. – Martijn Pieters Nov 24 '14 at 11:57
0

Since the OP wanted a JSON with dictionary-like objects and did not specify that they should be in a list within the JSON, this code might work better:

outFile = open("mobos.json", mode='wt')
for item in soupamd.select("div.listRow"):
    prodname = item.find("span", class_="name")
    adinfo = item.find("span", class_="additional")
    formfactor, grafisch, socket = item.find_all("span", class_="info")[:3]
    prijs = item.find("span", class_="price")
    tempDict = {
        'prodname': prodname.text.strip(),
        'adinfo': adinfo.text.strip(),
        'formfactor': formfactor.text.strip(),
        'grafisch': grafisch.text.strip(),
        'socket': socket.text.strip(),
        'prijs': prijs.text.strip(),
    }
    json.dump(tempDict, outFile)
outFile.close()

There is no need to write a new line because json.dump takes care of that automatically.

brethvoice
  • 350
  • 1
  • 4
  • 14