0
from bs4 import BeautifulSoup
import urllib2

opening Nytimes and reading the page

response = urllib2.urlopen('http://www.nytimes.com').read()
soup=BeautifulSoup(response)

data = []

I am taking all the headings on the homepage and taking them in to a list

for story_heading in soup.find_all(class_="story-heading"):
        story_title = story_heading.text.replace("\n", "").strip()
        new_story_title = story_title.encode('utf-8')

im converting the words of each title into a list

        words = new_story_title.split()
        data.append(words)
        print data

Now, I want to remove the numbers in this text how can i do it?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Kishan Jangam
  • 39
  • 1
  • 8
  • That may help: http://stackoverflow.com/questions/12851791/removing-numbers-from-string – alpert Apr 26 '16 at 04:55
  • Could you add some examples as well? What is this text and which numbers do you want to remove? – AKS Apr 26 '16 at 04:57
  • Be carefule, there have been real numbers in the original text, as well as the numbers coming from unicode encoding. Which ones do you want to remove? – tfv Apr 26 '16 at 04:59
  • I just want to append "alphabetical elements" into the list @tfv – Kishan Jangam Apr 26 '16 at 05:02
  • These are few titles guys!! I need to take out the numbers below including "$2" if i can ""[De Blasio to Propose $2 Billion for New York City’s Hospital System] [Trump Agrees to Interview With Megyn Kelly, Fox News Says 7:32 PM] [8 Years of Lessons Temper Obama’s Foreign Policy Goals] [Goodell Remains Firmly in Control 9:00 PM ET] [Students’ National Anthem Is Stopped at 9/11 Memorial 9:22 PM ET]"" – Kishan Jangam Apr 26 '16 at 05:11

2 Answers2

0

try this code

clean_text = ''.join([i for i in data if not i.isdigit()])

Source: HERE

words = ''.join([i for i in new_story_title if not i.isdigit()]).split()
data.append(words)
print data

Try the code above

Community
  • 1
  • 1
0

[EDIT] Updated for taking out digits in words:

Try this:

from bs4 import BeautifulSoup
import urllib2

#opening Nytimes and reading the page

response = urllib2.urlopen('http://www.nytimes.com').read()
soup=BeautifulSoup(response)

data = []

#I am taking all the headings on the homepage and taking them in to a list

for story_heading in soup.find_all(class_="story-heading"):
    story_title = story_heading.text.replace("\n", "").strip()
    new_story_title = story_title.encode('utf-8')

#im converting the words of each title into a list

    words = new_story_title.split()
    data.append(words)
print data

clean_data=[]
for i in data:
    for j in i:
        word=[]
        for k in j:
            if not k.isdigit():
                word.append(k)
        clean_data.append(''.join(word))
print clean_data
tfv
  • 6,016
  • 4
  • 36
  • 67