How to remove special characters from txt files using Python

Question

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern

So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt

The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.

I'd like to remove special characters from 42 text files using Windows text editor

Or make an exception rule that solve this problem.

If using the latter, how shoud I make up my code?

Make it to directly modify text files? Or make an exception that doesn't count special characters?

[How to format code on SO](http://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks) — Levon, Aug 10 '12 at 12:51

score 9 · Answer 1 · edited Aug 10 '12 at 18:13

9

import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)

It will change every non alphanumeric char to white space.

edited Aug 10 '12 at 18:13

Lanaru

9,421
7
38
64

answered Aug 10 '12 at 12:57

NIlesh Sharma

5,445
6
36
53

1

You shouldn't use `str` for the name of a variable since it's a built-in class. – Lanaru Aug 10 '12 at 18:12

score 2 · Answer 2 · edited May 23 '17 at 11:51

I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).

As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):

fileString.translate(None, string.punctuation)

where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.

In the event that the above doesn't work, you could modify it as follows:

inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)

There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.

Removing Punctuation From Python List Items

Remove all special characters, punctuation and spaces from string

Strip Specific Punctuation in Python 2.x

Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.

score 0 · Answer 3 · answered Aug 10 '12 at 13:03

import re

Then replace

[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]

By

[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]

This will strip all trailing non-alphanumeric characters from each word before adding it to the set.

score 0 · Answer 4 · answered Nov 14 '22 at 12:24

When working in Linux, some system files in /proc lib contains chars with ascii value 0.

            full_file_path = 'test.txt'
            result = []
            with open(full_file_path, encoding='utf-8') as f:

                line = f.readline()
                for c in line:
                    if ord(c) == 0:
                        result.append(' ')
                    else:
                        result.append(c)
            print (''.join(result))

How to remove special characters from txt files using Python

4 Answers4

Linked

Related