-1

I just scraped text data from a website and that data contains numbers, special characters and punctuation. After splitting the data and I tried to keep plain text but I'm getting spcaes, numbers, special characters. How to remove all those things and keep the text free from above things.

url = 'www.example.com'
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
extracted_data = text.split()
refined_data = []
SYMBOLS = '{}()[].,:;+-*/&|<>=~0123456789'
for i in extracted_data:
    if i not in SYMBOLS:
       refined_data.append(i)
print("\n", "$" * 50, "HEYAAA we got arround: ", len(refined_data), " of keywords! Here are they: ","$" * 50, "\n")
print(type(refined_data)) 


output:

1.My
2.system
3.showing
4.error
5.404
6.I
7.don't
8.understand
9.why
10. it
11. showing ,
12.like
13.this?
14.53251
15.$45
Jainmiah
  • 439
  • 6
  • 16
  • As there are many cases for what you are asking, Its better to show sample text and desired output – ashishmishra Apr 20 '20 at 05:56
  • @ashishmishra I just added an example output. The text which extracted contains more punctuations, white spaces, numbers and special characters. So I want to clear all those from my text and keep my text plain. – Jainmiah Apr 20 '20 at 06:18

1 Answers1

1

extracted_data is the result of string.split()

The string.split() method used as such will split your text along 'any whitespaces'.

The not in operator compares i (the entire string) to a sequence. Your sequence here is just a single string, so it's like a list of the individual characters in that string.

So is 'system' in the sequence SYMBOLS? Asked again: is the string 'system' any of the characters in SYMBOLS? No it is not. Therefore, your if statement is executed and it is appended to your product.

Is '53251' in the list of one characters SYMBOLS? Not it is not. Therefore, it is appended.

And so on.


Such a list comparison is not necessary. You should be using str.strip()

650aa6a2
  • 172
  • 11