Text Cleaning python

Question

I wrote a code where i pull the text and then search for the sentences using keywords. I am getting the below output:

['& ldquo ; it & rsquo ; s been cited by a number of market watcher where the real value of cloud is , and it & rsquo ; s moving up the stack .', '& ldquo ; we & rsquo ; re not letting go of our system space , but i think we & rsquo ; re being more specific about which bit fit with which part of where the growth is going , and each element within ibm need to justify it position a we go forward & ndash ; and i think that wa the background behind the lenovo announcement .& rdquo ; this resonates sonorously with what rometty wrote in her annual letter , telling shareholder that the big challenge for this year would be & ldquo ; shifting the ibm hardware business for new reality and opportunity .& rdquo]

I don't know what are these rsquo,ldquo which is breaking the text. Below is my code

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Chrome')]
html = br.open(url).read()
titles = br.title()
readable_article= Document(html).summary()
readable_title = Document(html).short_title()
soup = bs4.BeautifulSoup(readable_article)
Final_Article = soup.text
final.append(titles)
final.append(url)
final.append(Final_Article)
raw = nltk.clean_html(html)
tokens = nltk.wordpunct_tokenize(raw)
lmtzr = WordNetLemmatizer()
t = [lmtzr.lemmatize(t) for t in tokens]
text = nltk.Text(t)
word = words(n)
find = ' '.join(str(e) for e in word)
search_words = set(find.split(' '))
sents = ' '.join([s.lower() for s in text])
blob = TextBlob(sents.decode('ascii','ignore'))
matches = [map(str, blob.sentences[i-1:i+2])     # from prev to after next
for i, s in enumerate(blob.sentences) # i is index, e is element
    if search_words & set(s.words)]
        print matches,word

Yur code is invalid; the `:` colons are missing and the indentation was all over the place. — Martijn Pieters, Jul 21 '14 at 17:09
You need to unescape your html; this [answer](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) covers it — sirlark, Jul 21 '14 at 17:12

Adam Yost · Accepted Answer · 2014-07-21T17:33:25.170

3

” and “ are codes for the open and close quotes. rsquo and lsquoare single quotes (used in this text as appostraphes) andndash` is a dash. If those patterns are present in your source text use the following to replace them.

import re
cleaned = re.sub(r'& ?(ld|rd)quo ?[;\]]', '\"', raw)
cleaned = re.sub(r'& ?(ls|rs)quo ?;', '\'', cleaned)
cleaned = re.sub(r'& ?ndash ?;', '-', cleaned)

This replaces both codes (with or without spaces) in your raw text (which I called raw) with a quotation mark and saves it to a new variable called cleaned. Passing cleaned through the rest of your code should work.

edited Jul 21 '14 at 17:33

answered Jul 21 '14 at 17:08

Adam Yost

3,616
23
36

Thanks a mil. IT worked like magic :) but i still have rdquo..when i add that in the regex it takes away a part of the text :( – Raghav Shaligram Jul 21 '14 at 17:15
Edited to include ls, ld, rs, rd quotes – Adam Yost Jul 21 '14 at 17:23
It escapes the last line from the text. w/o the rd in regex the output differs – Raghav Shaligram Jul 21 '14 at 17:27
Looking at it closer, you have several html escapes in your text. You should unescape the html, rather than a case by case replace as I have suggested. – Adam Yost Jul 21 '14 at 17:30
1

I used str(matches).replace('& rdquo','').replace('& rsquo','') in final output, will make iterations basis the outputs. Thanks again – Raghav Shaligram Jul 21 '14 at 17:39

Text Cleaning python

1 Answers1