0

y'all. I've been trying to remove stopwords from a list that a pdf has been read to, but whenever I use nltk to remove those stopwords from the list or from a new list, it returns the original list back to me in the TXT file. I have made a separate program just to test if the stopwords function even works, and it works fine there but for some reason not in this case.

Is there also a better method to do this? Any help would be much appreciated.

import PyPDF2 as pdf

import nltk
from nltk.corpus import stopwords

stopping_words = set(stopwords.words('english'))

stop_words = list(stopping_words)

# creating an object 
file = open("C:\\Users\\Name\\Documents\\Data Analytics Club\\SampleBook-English2-Reading.pdf", "rb")

# creating a pdf reader object
fileReader = pdf.PdfFileReader(file)

# print the number of pages in pdf file
textData = []

for pages in fileReader.pages:
    theText = pages.extractText()

    # for char in theText:
    #   theText.replace(char, "\n")

    textData.append(theText)

final_list = []

for i in textData:
    if i in stopwords.words('english'):
        textData.remove(i)
    final_list.append(i.strip('\n'))

# filtered_word_list = final_list[:] #make a copy of the word_list

# for word in final_list: # iterate over word_list
#   if word in stopwords.words('english'):
#       final_list.remove(word) # remove word from filtered_word_list if it is a stopword

# filtered_words = [word for word in final_list if word not in stop_words]

# [s.strip('\n') for s in theText]
# [s.replace('\n', '') for s in theText]


# text_data = []

# for elem in textData:
#         text_data.extend(elem.strip().split('n'))  

# for line in textData:
#     textData.append(line.strip().split('\n'))
#--------------------------------------------------------------------

import os.path

save_path = "C:\\Users\\Name\\Documents\\Data Analytics Club"

name_of_file = input("What is the name of the file: ")

completeName = os.path.join(save_path, name_of_file + ".txt")   

file1 = open(completeName, "w")

# file1.write(str(final_list))

for line in final_list:
    file1.write(line)

file1.close()

1 Answers1

1

The problem is in this line

if i in stopwords.words('english'):
    textData.remove(i)

You are only removing a single occurrence of that word. If you read here it simply removes the first occurrence of the word.

What you probably want to do instead to remove it is:

Python 2

filter(lambda x: x != i, textData)

Python 3

list(filter(lambda x: x != i, textData))

EDIT

So I realized quite a bit late that you are actually iterating over the list that you are removing elements from. So, you would probably not want to do that. For more information, reference here

Instead, what you would want to do is:

for i in set(textData):
    if i in stopwords.words('english'):
        pass
    else
        final_list.append(i.strip('\n'))

EDIT 2

So apparently the issue comes from here and needs to be fixed to:

for pages in fileReader.pages:
    theText = pages.extractText()
    words = theText.splitlines()
    textData.append(theText)

However, for the file I tested this against, it still gave issues with spacing and merged words in the same sentence. It gave me words such as 'sameuserwithinacertaintimeinterval(typicallysettoa' and 'bedirectionaltocapturethefactthatonestorywasclicked'

That being said, the issue lies within the PyPDF2 class. You may wish to resort to another reader. Comment if it still doesn't help

Haris Nadeem
  • 1,322
  • 11
  • 24
  • Thanks for the input, but the txt file still looks the same as before, without any of the stopwords() removed. I have tried numerous methods, and yours looked very promising, but alas Python doesn't like me very much I guess. Could the problem be linked to the txt file writing portion? It worked with a regular print() but not in the txt. Thanks. – user8095302 Apr 09 '18 at 02:51
  • Give me a minute to test something. I believe that `pages.extractText()` is providing one long string instead of words. If that is the case, then you will need to use `split(" ")` to form it into words. – Haris Nadeem Apr 09 '18 at 03:10
  • And don't lose hope, we have all been down this road where we feel that the programming language doesn't treat us well. And you did a pretty good job so far, so kudos :) – Haris Nadeem Apr 09 '18 at 03:11
  • I edited my solution. Sadly it seems the issue is with the pypdf2 library – Haris Nadeem Apr 09 '18 at 03:24
  • Aww that's a shame. I currently don't know of any other libraries or methods to read words from a pdf. Anyhow, thanks for all of the help and for the kind words of encouragement! – user8095302 Apr 10 '18 at 00:43