Trying to strip a website to give me only the content I want and not everything else as well.
<li tabindex="0">
Facebook.
</li>
<li tabindex="0">
Twitter.
</li>
<li tabindex="0">
Pinterest.
</li>
<li tabindex="0">
Instagram.
</li>
<li tabindex="0">
Enter to Win.
</li>
That's part of what I'm trying to strip out. Basically a store ad I'm stripping to strip off the stuff I don't want and be left with what is remaining in the store ad.
I'm having some very strange things occur. I've worked around a couple of them but I still can't get rid of the '\n' 's no matter what I try to do.
a = re.findall('<li tabindex(.*?)</li>', html, re.DOTALL)
for x in range(0, len(a)):
a[x] = a[x].replace('="0">', '')
a[x] = a[x].replace('Enter to Win.', 'REMOVE')
a[x] = a[x].replace('Pinterest.\n \n', 'REMOVE')
a[x] = a[x].replace('Twitter.\n \n', 'REMOVE')
a[x] = a[x].replace('Instagram.\n \n', 'REMOVE')
a[x] = a[x].replace('Facebook.\n \n', 'REMOVE')
When I have the full downloaded webpage in 'a'...you notice I have to pull off the 'li tabindex' in rather strange fashion or it won't splilt apart the separate lines like it normally would. It comes up completely empty when I print(a). Just a quick fanciful way I figured out how to split the separate entries apart.
Right now I'm trying to remove the '\n' and I can't get them to remove no matter what I try.
a[x] = a[x].replace('\n', '') # doesn't work
a[x] = a[x].replace('\n\n', '') # doesn't work
a[x] = a[x].replace('\r\n', '') # doesn't work
a[x] = a[x].replace('%s\n', '') # doesn't work
a[x] = a[x].replace('%s\r\n', '') # doesn't work
a[x] = a[x].rstrip('\r\n') # doesn't work
a[x] = a[x].strip('\r\n') #doesn't work
I've tried everything I've seen thus far online to try and nothing is letting me remove the \n. I can remove the ' ' between the \n's but I can't remove the \n's.
What do I have to do to remove the '\n' and maybe just as importantly why would I be having trouble do the standard line separation 'li tabindex'? Something gives me the feeling the answer may be one and the same cause. I've never had this kind of a problem before.
Update, original code I've started with:
import os
import re
from urllib.request import urlopen
from urllib.error import HTTPError
import urllib.request
plot = 'https://circulars.save-a-lot.com/flyers/accessibility/savealot?locale=en-US&store_code=24607&type=2'
htm = urlopen(plot).read()
html = str(htm)
a = re.findall("<li tabindex(.*?)</li>", html, re.DOTALL)
for x in range(0, len(a)):
a[x] = a[x].replace('="0">', '')
a[x] = a[x].replace(' ', '')
b = ''
for c in range(2,int(len(a[x])-2)):
if a[x][c] == '\n':
continue
else:
b = b + a[x][c]
a[x] = b
a[x] = a[x].replace('Flipp.', 'REMOVE')
a[x] = a[x].replace('Instagram.', 'REMOVE')
a[x] = a[x].replace('Facebook.', 'REMOVE')
#etc removing what I don't want to keep
if a[x] == 'REMOVE':
continue
else:
#write file to disk