Stripping html isn't working as expected

Question

Trying to strip a website to give me only the content I want and not everything else as well.

  <li tabindex="0">
    Facebook.

  </li>
  <li tabindex="0">
    Twitter.

  </li>
  <li tabindex="0">
    Pinterest.

  </li>
  <li tabindex="0">
    Instagram.

  </li>
  <li tabindex="0">
    Enter to Win.

  </li>

That's part of what I'm trying to strip out. Basically a store ad I'm stripping to strip off the stuff I don't want and be left with what is remaining in the store ad.

I'm having some very strange things occur. I've worked around a couple of them but I still can't get rid of the '\n' 's no matter what I try to do.

a = re.findall('<li tabindex(.*?)</li>', html, re.DOTALL)
for x in range(0, len(a)):
    a[x] = a[x].replace('="0">', '')
    a[x] = a[x].replace('Enter to Win.', 'REMOVE')
    a[x] = a[x].replace('Pinterest.\n    \n', 'REMOVE')
    a[x] = a[x].replace('Twitter.\n    \n', 'REMOVE')
    a[x] = a[x].replace('Instagram.\n    \n', 'REMOVE')
    a[x] = a[x].replace('Facebook.\n    \n', 'REMOVE')

When I have the full downloaded webpage in 'a'...you notice I have to pull off the 'li tabindex' in rather strange fashion or it won't splilt apart the separate lines like it normally would. It comes up completely empty when I print(a). Just a quick fanciful way I figured out how to split the separate entries apart.

Right now I'm trying to remove the '\n' and I can't get them to remove no matter what I try.

a[x] = a[x].replace('\n', '') # doesn't work
a[x] = a[x].replace('\n\n', '') # doesn't work
a[x] = a[x].replace('\r\n', '') # doesn't work
a[x] = a[x].replace('%s\n', '') # doesn't work
a[x] = a[x].replace('%s\r\n', '') # doesn't work
a[x] = a[x].rstrip('\r\n') # doesn't work
a[x] = a[x].strip('\r\n') #doesn't work

I've tried everything I've seen thus far online to try and nothing is letting me remove the \n. I can remove the ' ' between the \n's but I can't remove the \n's.

What do I have to do to remove the '\n' and maybe just as importantly why would I be having trouble do the standard line separation 'li tabindex'? Something gives me the feeling the answer may be one and the same cause. I've never had this kind of a problem before.

Update, original code I've started with:

import os
import re
from urllib.request import urlopen
from urllib.error import HTTPError
import urllib.request 

plot = 'https://circulars.save-a-lot.com/flyers/accessibility/savealot?locale=en-US&store_code=24607&type=2'
htm = urlopen(plot).read()
html = str(htm)

a = re.findall("<li tabindex(.*?)</li>", html, re.DOTALL)
for x in range(0, len(a)):
    a[x] = a[x].replace('="0">', '')
    a[x] = a[x].replace('  ', '')

    b = ''
    for c in range(2,int(len(a[x])-2)):
        if a[x][c] == '\n':
            continue
        else:
            b = b + a[x][c]
    a[x] = b
    a[x] = a[x].replace('Flipp.', 'REMOVE')
    a[x] = a[x].replace('Instagram.', 'REMOVE')
    a[x] = a[x].replace('Facebook.', 'REMOVE')
    #etc removing what I don't want to keep
    if a[x] == 'REMOVE':
        continue
    else:
        #write file to disk

Both rstrip() and combinations of rstip('\n'), etc. They don't remove it at all. Before anyone can ask...https://circulars.save-a-lot.com/flyers/accessibility/savealot?locale=en-US&store_code=24607&type=2 is one of the webpages I'm working with trying to pull off the weekly sales. — confused, May 22 '17 at 19:30

Matthew Barlowe · Accepted Answer · 2017-05-23T22:51:37.633

0

import bs4, requests
sales_list = []
sales_list_stripped = []
url = 'https://circulars.save-a-lot.com/flyers/accessibility/savealot? \
locale=en-US&store_code=24607&type=2'#make sure to either put a '\' here 
                                     #to split the url between two lines 
                                     #or put it all on one line

html = requests.get(url)
html_soup = bs4.BeautifulSoup(html.text, 'lxml')
filtered_html = html_soup.select('li')

for x in filtered_html:  #pulls text from within 'li' tags
    sales_list.append(x.getText())

for x in sales_list:   #removes \n character
    sales_list_stripped.append(x.replace('\n', ''))

print(sales_list_stripped[:8]) #test code

This code got me a list with an output like this [' Weekly Ad ', ' Other 70 items ', ' Banquet Pot Pies. $0.69 ea. 7 oz, Assorted Varieties ', ' Save-A-Lot® Soda 12 Pack. 2/ $5.00 . 12 oz cans, Assorted Varieties, ', ' J.Higgs Snacks. $3.99 ea. 16 ct, Classic or Flavor Mix ', ' Mondo Fruit Squeezers. $0.99 ea. 40.5 oz, Assorted Varieties ', ' Kiggins Frosty Flakes, Fruity Ringers or CrocO Crunch Cereal. $2.79 ea. 28 oz ', ' Kiggins Toaster Tarts. $1.99 ea. 22 oz, Assorted Varieties ', ' Nature Trails Granola Bars. $1.79 ea. 8.4 oz, Assorted Varieties ', ' True Fruit Cups. 10/ $10.00 . 7 oz, Assorted Varieties ']

I'm not a big fan of .findall() as select() is the preferred method by the bs4 documentation . Hope this helps

edited May 23 '17 at 22:51

answered May 23 '17 at 01:45

Matthew Barlowe

2,229
1
14
24

The trouble is even when I add in the extra ) at the end of the code, you missed the ), I still end up getting a Nonetype object is not callable error. – confused May 23 '17 at 19:39
Hmmm it worked fine for me do you have the rest of your code to look at ? – Matthew Barlowe May 23 '17 at 19:47
I was just trying your code to see if I could get it to work for me and other than the Nonetype error it seemed like you might be onto something. From some experimenting around it almost seems like like when I download the code it brings the \n across as '\' and 'n' versus '\n'. When I have it show how many characters are in a given line it is showing the count for '\n' as 2 not 1. I'm still trying to research how get rid of t he darn special case characters, haven't had that one to deal with before. – confused May 23 '17 at 19:55
I just updated the code to show what I have been trying. I also just tried re.sub(r'[^a-zA-Z0-9]', '', s) and it removed the '\' only and not the whole '\n' so I think what I said in the previous comment is what actually is occuring. Still haven't figured out how to remove the '\' though other than the re.sub and it's leaving the 'n' behind. – confused May 23 '17 at 20:41
What modules are you importing? – Matthew Barlowe May 23 '17 at 20:43
Ok I've just updated my code I wrote for you I left out an important step where you pull the text from between the tags with the `.getText()` instance which threw the none type exception. Also I see you are trying to use regexes to parse html code I would advise against that here's a [link](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) why that isn't good. Like with my code I'd suggest knowing the beautiful soup module for parsing html it's 10x easier. I like this [site](https://automatetheboringstuff.com/chapter11/) – Matthew Barlowe May 23 '17 at 22:47
Thanks. that does work now. I think this is the first time I've run into 'oddball' html where regex wouldn't work, hence why I've always used it. One of those cases of where you start doing something one way, you keep doing it that way until it no longer works. – confused May 23 '17 at 23:06

score 0 · Answer 2 · answered May 23 '17 at 01:57

Why are you going through all this trouble to get rid off individual characers? Just let regex do all the dirty work for you in one swoop:

data = re.findall("<li tabindex.*?>\s+(.*?)\.?\s+.*?</li>", content)
# ['Facebook', 'Twitter', 'Pinterest', 'Instagram', 'Enter to Win']

This even gives a little bit of flexibility with spaces and dot after the content.

Stripping html isn't working as expected

2 Answers2