1

As you can see I have two variables defined: a variable named href which has multiple links as one string and a variable named text, now in text I have the links that I have already visited/downloaded from. I want Python to print the text that is present in href but not in text.

So I imagine its using a for loop?

When I execute single letters get returned, all separated on a different line.

import requests
from bs4 import BeautifulSoup

url = 'amazon.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

for link in soup.findAll('a', {'class': 'gridItem-trackInfo-title-anchor'}):
    href = link.get('href')

    file = open('file.txt', 'r')
    text = file.read()
    file.close

    for i in href:
        if i not in text:
            print(i)
George R
  • 43
  • 1
  • 1
  • 5
  • 4
    What have you tried, and can you post a more complete example? SO is not a code writing service, we can help you troubleshoot what you have done, but not write the code for you. – Alex Huszagh Dec 20 '15 at 18:17
  • Can you provide an example of `href`? – Iron Fist Dec 20 '15 at 18:22
  • I used Beautifulsoup to gather all the links of a certain HTML class and I stored those links in href. – George R Dec 20 '15 at 18:30
  • 1
    Can you post an example of your code? Ideally, can you create an example that is **minimal, complete, verifiable**, as designated by the guidelines here: https://stackoverflow.com/help/mcve – Alex Huszagh Dec 20 '15 at 18:32
  • Thanks for the complete example: now we can get somewhere! One quick think I will mention: Your `soup.findAll` returns no items. You probably also want `href.text` to do the `i not in text` comparison. – Alex Huszagh Dec 20 '15 at 18:43

2 Answers2

1

If you just want the input on a single line, use print(i, end='') and you should be ok.


If you want links you should do

for i in links(href):
    if i not in links(text):
        print(i)

Where the links function may be found at retrieve links from web page using python and BeautifulSoup


If you want links and not letters use:

    if link not in text:
        print(link)

Before you were looping over the letters of each link.

Instead of:

for i in href:
    if i not in text:
        print(i)
Community
  • 1
  • 1
Caridorc
  • 6,222
  • 2
  • 31
  • 46
  • Python thinks I'm asking which letters are not present in text but what I'm looking for is asking complete links – George R Dec 20 '15 at 18:52
  • @GeorgeR is so you need to write a function to extract the links from a webpage. see: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup – Caridorc Dec 20 '15 at 18:53
  • The links I have stored in href, now what I want to do is have a string printed out that states which links are not present in text. And all that is returned are single letters – George R Dec 20 '15 at 18:56
  • @GeorgeR updated, try the last suggestion and see if it works for you – Caridorc Dec 20 '15 at 18:59
0

It seems that href is a string and you are iterating over it. Is this code any better ?

import requests
from bs4 import BeautifulSoup

url = 'amazon.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

file = open('file.txt', 'r')
text = file.read()
file.close()

for link in soup.findAll('a', {'class': 'gridItem-trackInfo-title-anchor'}):
    href = link.get('href')

    if href not in text:
        print(href)
jmaz
  • 81
  • 5