24

I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.

hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.

should be

hello there A Z R T world welcome to python
this should be the next line followed by another million like this
pythonlearn
  • 439
  • 1
  • 3
  • 12
  • 1
    Just create a list of the characters you want, A-Z, a-z, 0-9, etc.. And use a `for` loop for iterate over each character in the string replacing the characters thats not in the list with a space. – Wright Apr 12 '17 at 01:47
  • 3
    is that efficient for a huge corpus of million of lines of text? – pythonlearn Apr 12 '17 at 01:49

5 Answers5

39

You can use this pattern, too, with regex:

import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''

for k in a.split("\n"):
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
    # Or:
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
    # print(final)

Output:

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this 

Edit:

Otherwise, you can store the final lines into a list:

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)

Output:

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
Phillip Kigenyi
  • 1,359
  • 14
  • 21
Chiheb Nexus
  • 9,104
  • 4
  • 30
  • 43
9

I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word

print  re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python
Eliethesaiyan
  • 2,327
  • 1
  • 22
  • 35
6

you can try this

import re
sentance = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
res = re.sub('[!,*)@#%(&$_?.^]', '', sentance)
print(res)

re.sub('["]') -> here you can add which symbol you want to remove

riya
  • 130
  • 1
  • 9
4

A more elegant solution would be

print(re.sub(r"\W+|_", " ", string))

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

Here, re is regex module in python

re.sub will substitute pattern with space i.e., " "

r'' will treat input string as raw (with \n)

\W for all non-words i.e. all special characters *&^%$ etc excluding underscore _

+ will match zero to unlimited matches, similar to * (one to more)

| is logical OR

_ stands for underscore

ssp4all
  • 371
  • 2
  • 11
0

Create a dictionary mapping special characters to None

d = {c:None for c in special_characters}

Make a translation table using the dictionary. Read the entire text into a variable and use str.translate on the entire text.

wwii
  • 23,232
  • 7
  • 37
  • 77