Finding text not present in a string

Question

As you can see I have two variables defined: a variable named href which has multiple links as one string and a variable named text, now in text I have the links that I have already visited/downloaded from. I want Python to print the text that is present in href but not in text.

So I imagine its using a for loop?

When I execute single letters get returned, all separated on a different line.

import requests
from bs4 import BeautifulSoup

url = 'amazon.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

for link in soup.findAll('a', {'class': 'gridItem-trackInfo-title-anchor'}):
    href = link.get('href')

    file = open('file.txt', 'r')
    text = file.read()
    file.close

    for i in href:
        if i not in text:
            print(i)

What have you tried, and can you post a more complete example? SO is not a code writing service, we can help you troubleshoot what you have done, but not write the code for you. — Alex Huszagh, Dec 20 '15 at 18:17
I used Beautifulsoup to gather all the links of a certain HTML class and I stored those links in href. — George R, Dec 20 '15 at 18:30
Can you post an example of your code? Ideally, can you create an example that is **minimal, complete, verifiable**, as designated by the guidelines here: https://stackoverflow.com/help/mcve — Alex Huszagh, Dec 20 '15 at 18:32
Thanks for the complete example: now we can get somewhere! One quick think I will mention: Your `soup.findAll` returns no items. You probably also want `href.text` to do the `i not in text` comparison. — Alex Huszagh, Dec 20 '15 at 18:43

score 1 · Answer 1 · edited May 23 '17 at 12:23

1

If you just want the input on a single line, use print(i, end='') and you should be ok.

If you want links you should do

for i in links(href):
    if i not in links(text):
        print(i)

Where the links function may be found at retrieve links from web page using python and BeautifulSoup

If you want links and not letters use:

    if link not in text:
        print(link)

Before you were looping over the letters of each link.

Instead of:

for i in href:
    if i not in text:
        print(i)

edited May 23 '17 at 12:23

Community

1
1

answered Dec 20 '15 at 18:45

Caridorc

6,222
2
31
46

Python thinks I'm asking which letters are not present in text but what I'm looking for is asking complete links – George R Dec 20 '15 at 18:52
@GeorgeR is so you need to write a function to extract the links from a webpage. see: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup – Caridorc Dec 20 '15 at 18:53
The links I have stored in href, now what I want to do is have a string printed out that states which links are not present in text. And all that is returned are single letters – George R Dec 20 '15 at 18:56
@GeorgeR updated, try the last suggestion and see if it works for you – Caridorc Dec 20 '15 at 18:59

score 0 · Accepted Answer · answered Dec 20 '15 at 18:57

It seems that href is a string and you are iterating over it. Is this code any better ?

import requests
from bs4 import BeautifulSoup

url = 'amazon.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

file = open('file.txt', 'r')
text = file.read()
file.close()

for link in soup.findAll('a', {'class': 'gridItem-trackInfo-title-anchor'}):
    href = link.get('href')

    if href not in text:
        print(href)

Finding text not present in a string

2 Answers2