All,
I have some text that I need to clean up and I have a little algorithm that "mostly" works.
def removeStopwords(self, data):
with open(r'stopwords.txt') as stopwords:
wordList = []
for i in stopwords:
wordList.append(i.strip())
charList = list(data)
cat = ''.join(char for char in charList if not char in wordList).split()
return ' '.join(cat)
Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.
A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.
The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.
A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit
Note the word "selfcontained" was "self-contained".
EDIT: Contents of the stopwords file which is just a bunch of chars.
! $ % ^ , & * ( ) { } [ ] <
, . / | \ ? ~ ` : ; "
Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.
cat = ''.join(data.translate(None, string.punctuation)).split()
print ' '.join(cat).lower()