1

I'm trying to do a script to automate a simple task of removing characters from txt files and I want to save it with the same name but without the chars. I have multiple txt files: e.g 1.txt, 2.txt ... 200.txt, stored in a directory (Documents). I have a txt file with the characters I want to remove. At the beginning I though to compare my chars_to_remove.txt to all my different files (1.txt, 2.txt...) but I could find a way to do so. Instead, I created a string with all those chars I want to remove.

Let's say I have the following string in 1.txt file:

Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).

I want to remove α, β, and Δ chars from the string. This is my code as far.

import glob 
import os 

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    outfile = open(f,'r',encoding='latin-1')
    data = outfile.read()
    if chars_to_remove in data:
        data.replace(chars_to_remove, '')
    outfile.close()

The variable data stores in each iteration all the content from the txt files. I want to check if there are chars_to_remove in the string and remove it with replace() function. I tried different approaches suggested here and here without sucess.

Also, I tried to compare it as a list:

chars_to_remove = ['‘','’','“','”','|','n.d.','…','•','∈','α','β','δ','Δ','ε','θ','ϑ','φ','Σ','μ','τ','σ','χ','€','$','∞','http:','www.','←','→','≥','≤','<','>','▷','×','°','±','*','⁃']

but got datatype errors when comparing.

Any further idea will be appreciated!

M Z
  • 4,571
  • 2
  • 13
  • 27

2 Answers2

1

It may not be as fast, but why not use Regex to remove the characters/phrases?

import re

pattern = re.compile(r"(‘|’|“|”|\||n.d.|…|•|∈|α|β|δ|Δ|ε|θ|ϑ|φ|Σ|μ|τ|σ|χ|€|$|∞|http:|www.|←|→|≥|≤|<|>|▷|×|°|±|\*|⁃)")
result = pattern.sub("", 'Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).')
print(result)

Output

Mean concentrations , maximum value ratio  and reductions in NO2 due to the lockdown , March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).
0

Most efficient way is string.translate in order to avoit loop on each invalid char. Outfile must be define in some manner.

import glob 
import os
from string import maketrans

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'
translator = maketrans(chars_to_remove,'\0'*len(chars_to_remove))

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    infile = open(f,'r',encoding='latin-1')
    data = infile.read()
    data.translate(translator).replace('\0','')
    infile.close()
    
    #Now data is translated
    # You must write it in a new file
    with open('...','wt') as outfile:
        outfile.write(data)
        

Hit

This code works, but it is inefficient, files are fully loaded in memory. a better way is to roll over infile and in the meanwhile write on outfile.

Glauco
  • 1,385
  • 2
  • 10
  • 20