Beautifulsoup spaghetti code, appending problem

Question

I have a code which allows me to pull the links of some news sites. I want only to pull the links with the name of the city - Gdańsk. However not always the correct spelling is used in the URL's, so I needed to put in gdańsk ,gdansk etc. I also want to pull it from different sites. I was able to add more words and sites, but it made me do more for loops. Would you please direct me on how I can make the code more efficient and shorter?

Second question: I'm exporting the links I receive into a CSV file. I want to gather them there to later analize them. I found out that if i replace "w" with "a" in the csv = open(plik,"a") it should be appending the file. Instead - nothing happens. When it's just "w" it's overwriting the file, but that's now what I need

import requests
from bs4 import BeautifulSoup as bs

from datetime import datetime
def data(timedateformat='complete'):

formatdaty = timedateformat.lower()

if timedateformat == 'rokmscdz':
    return (str(datetime.now())).split(' ')[0]
elif timedateformat == 'dzmscrok':
    return ((str(datetime.now())).split(' ')[0]).split('-')[2] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[1] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[0]


a = requests.get('http://www.dziennikbaltycki.pl')
b = requests.get('http://www.trojmiasto.pl')

zupa = bs(a.content, 'lxml')
zupka = bs(b.content, 'lxml')


rezultaty1 = [item['href'] for item in zupa.select(" [href*='Gdansk']")]
rezultaty2 = [item['href'] for item in zupa.select("[href*='gdansk']")]
rezultaty3 = [item['href'] for item in zupa.select("[href*='Gdańsk']")]
rezultaty4 = [item['href'] for item in zupa.select("[href*='gdańsk']")]

rezultaty5 = [item['href'] for item in zupka.select("[href*='Gdansk']")]
rezultaty6 = [item['href'] for item in zupka.select("[href*='gdansk']")]
rezultaty7 = [item['href'] for item in zupka.select("[href*='Gdańsk']")]
rezultaty8 = [item['href'] for item in zupka.select("[href*='gdańsk']")]

s = set()

plik = "dupa.csv"
csv = open(plik,"a")


for item in rezultaty1:
    s.add(item)
for item in rezultaty2:
    s.add(item)
for item in rezultaty3:
    s.add(item)
for item in rezultaty4:
    s.add(item)
for item in rezultaty5:
    s.add(item)
for item in rezultaty6:
    s.add(item)
for item in rezultaty7:
    s.add(item)
for item in rezultaty8:
    s.add(item)



for item in s:
    print('Data wpisu: ' + data('dzmscrok'))
    print('Link: ' + item)
    print('\n')
    csv.write('Data wpisu: ' + data('dzmscrok') + '\n')
    csv.write(item + '\n'+'\n')

What do you mean by _a more efficient and shorter code_? Faster than what baseline performance? Shorter in terms of lines of code? — sentence, Apr 02 '19 at 17:03

score 0 · Answer 1 · answered Apr 02 '19 at 17:21

Ideally to improve performance and trim the code even further from looping multiple times you could parse the results of your web pages and normalise by replacing all special characters with ASCII equivalents (Replacing special characters with ASCII equivalent).

You can avoid the repetition by changing your code to loop over Gdansk variations instead then merging results into a single set. I've modified your code below and split it into several functions.

import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime

def extract_links(content):
    # Return a list of hrefs that mention any variation of the city Gdansk
    variations = ['Gdansk', 'gdansk', 'Gdańsk', 'gdańsk']
    result = []
    for x in variations:
        result = [*result, *[item['href'] for item in content.select(f"[href*={x}]")]]
    return result

def data(timedateformat='complete'):
    formatdaty = timedateformat.lower()

    if timedateformat == 'rokmscdz':
        return (str(datetime.now())).split(' ')[0]
    elif timedateformat == 'dzmscrok':
        return ((str(datetime.now())).split(' ')[0]).split('-')[2] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[1] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[0]

def get_links_from_urls(*urls):
    # Request webpages then loop over the results to
    # create a set of links that we will write to our file.
    result = []
    for rv in [requests.get(url) for url in urls]:
        zupa = bs(rv.content, 'lxml')
        result = [*result, *extract_links(zupa)]
    return set(result)

def main():
    # use pytons context manager to open 'ass.csv' and write out csv rows
    plik = "dupa.csv"

    with open(plik, 'a') as f:
        for item in get_links_from_urls('http://www.dziennikbaltycki.pl', 'http://www.trojmiasto.pl'):
            print('Data wpisu: ' + data('dzmscrok'))
            print('Link: ' + item)
            print('\n')
            f.write(f'Data wpisu: {data("dzmscrok")},{item}\n')

main()

Hope this helps, let me know if you have any issues in the comments.

this is exactly what I needed! Thank you ! – aadams Apr 02 '19 at 18:15 — aadams, Apr 02 '19 at 18:15

Beautifulsoup spaghetti code, appending problem

1 Answers1