Python lists and web scraping

Question

from bs4 import BeautifulSoup
import urllib2

opening Nytimes and reading the page

response = urllib2.urlopen('http://www.nytimes.com').read()
soup=BeautifulSoup(response)

data = []

I am taking all the headings on the homepage and taking them in to a list

for story_heading in soup.find_all(class_="story-heading"):
        story_title = story_heading.text.replace("\n", "").strip()
        new_story_title = story_title.encode('utf-8')

im converting the words of each title into a list

        words = new_story_title.split()
        data.append(words)
        print data

Now, I want to remove the numbers in this text how can i do it?

That may help: http://stackoverflow.com/questions/12851791/removing-numbers-from-string — alpert, Apr 26 '16 at 04:55
Could you add some examples as well? What is this text and which numbers do you want to remove? — AKS, Apr 26 '16 at 04:57
Be carefule, there have been real numbers in the original text, as well as the numbers coming from unicode encoding. Which ones do you want to remove? — tfv, Apr 26 '16 at 04:59
I just want to append "alphabetical elements" into the list @tfv — Kishan Jangam, Apr 26 '16 at 05:02
These are few titles guys!! I need to take out the numbers below including "$2" if i can ""[De Blasio to Propose $2 Billion for New York City’s Hospital System] [Trump Agrees to Interview With Megyn Kelly, Fox News Says 7:32 PM] [8 Years of Lessons Temper Obama’s Foreign Policy Goals] [Goodell Remains Firmly in Control 9:00 PM ET] [Students’ National Anthem Is Stopped at 9/11 Memorial 9:22 PM ET]"" — Kishan Jangam, Apr 26 '16 at 05:11

score 0 · Accepted Answer · edited May 23 '17 at 10:28

0

try this code

clean_text = ''.join([i for i in data if not i.isdigit()])

Source: HERE

words = ''.join([i for i in new_story_title if not i.isdigit()]).split()
data.append(words)
print data

Try the code above

edited May 23 '17 at 10:28

Community

1
1

answered Apr 26 '16 at 04:57

Nguyễn Việt Hưng

60
1
6

I Tried it buddy It says: AttributeError: 'list' object has no attribute 'isdigit' – Kishan Jangam Apr 26 '16 at 05:00
is there any way i can append everything into single list rather than nested list? @tfv – Kishan Jangam Apr 26 '16 at 05:08
@Nguyễn Việt Hưng IT Worked. Instead of list your code just processed the normal text. Thank you for the idea. – Kishan Jangam Apr 26 '16 at 05:23
@KishanJangam you're wellcome – Nguyễn Việt Hưng Apr 26 '16 at 05:29

score 0 · Answer 2 · answered Apr 26 '16 at 05:17

[EDIT] Updated for taking out digits in words:

Try this:

from bs4 import BeautifulSoup
import urllib2

#opening Nytimes and reading the page

response = urllib2.urlopen('http://www.nytimes.com').read()
soup=BeautifulSoup(response)

data = []

#I am taking all the headings on the homepage and taking them in to a list

for story_heading in soup.find_all(class_="story-heading"):
    story_title = story_heading.text.replace("\n", "").strip()
    new_story_title = story_title.encode('utf-8')

#im converting the words of each title into a list

    words = new_story_title.split()
    data.append(words)
print data

clean_data=[]
for i in data:
    for j in i:
        word=[]
        for k in j:
            if not k.isdigit():
                word.append(k)
        clean_data.append(''.join(word))
print clean_data

Python lists and web scraping

2 Answers2