1

I have a set of regular expressions for substitution in a file (sed.clean) as follow:

#!/bin/sed -f
s/https\?:\/\/[^ ]*//g
s/\.//g
s/\"//g
s/\,//g
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/

and some more lines like those. I want to use this file for 'clean' a set of text files. To do this in bash I'd do something like this:

for file in $(ls rootDirectory)
do
    sed -f sed.clean $file > OUTPUT_FILE
done

How could I do something similar in Python?

What I mean is if it is possible to leverage the n RE which I have in the sed.clean file (or rewrite them in the proper Python format) in order to avoid building a nested loop to compare each file with each RE, and just compare each file with a sed.clean python file as I do in bash. Something like this:

files = [ f for f in listdir(dirPath) if isfile(join(dirPath,f)) ]
for file in files:
    newTextFile = re.sub(sed.clean, file)
    saveTextFile(newTextFile, outputPath)

instead of this:

REs = ['s/https\?:\/\/[^ ]*//g', 's/\.//g',...,'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/']
files = [ f for f in listdir(dirPath) if isfile(join(dirPath,f)) ]
for file in files:
    for re in REs:
        newTextFile = re.sub(re, '', file)
        saveTextFile(newTextFile, outputPath)

Thanks!

Mad Max
  • 11
  • 3

3 Answers3

0

These sed patters appear to blank out lines matching certain patterns from a file. In python readlines(), filter() and re.sub() would be your best pick.

Bonifacio2
  • 3,405
  • 6
  • 34
  • 54
Freek Wiekmeijer
  • 4,556
  • 30
  • 37
  • Hi Freek, Yes I know it, but what I'm looking for is a way of avoiding making a nested loop for each text file and regular expression. I've rewritten my question to clarify it better. – Mad Max Apr 03 '14 at 12:33
  • So you want to avoid the N*M complexity (N lines * M regexp patterns). I agree with JOHN's comment on the post above. Probably sed implements a nested loop as well (for r in expresssions for l in lines).The only way out of the N*M complecity seems to be something really complex where you sort the lines (which comes at an N*log(N) cost also). – Freek Wiekmeijer Apr 03 '14 at 14:38
0

Try the re.sub like this:

import re
>>> re.compile(r'\.')
<_sre.SRE_Pattern object at 0x9d48c80>
>>> MY_RE = re.compile(r'\.')
>>> MY_RE.sub('','www.google.com')
'wwwgooglecom'

You can compile any regex in re.compile()

WisZhou
  • 1,369
  • 10
  • 8
  • Thanks, WisZhou, but what I'm looking for is to iterate over every text file to replace the regular expressions from a single file, instead of iterating over every single regular expression. – Mad Max Apr 03 '14 at 12:27
0

You'll have to convert your sed script replacements to Python equivalents.


s/<pattern>/<replacement>/<flags>
# is equivialent to
re.sub("<pattern>", "<replacement>", <input>, flags=<python-flags>)

Note that this is greedy, so there's no need for /g at the end of the pattern. Moreover, you should not include lags in the pattern, as they are passed as a separate parameter. For example:

re.sub("\.", "", "a.b.c.d", flags=re.MULTILINE)

y/<pattern>/<replacement>
# is equivivalent to
trans = str.maketrans("<pattern>", "<replacement>")
<input>.translate(trans)

But in the case of y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ it's just as simple as <input>.lower().


for file in $(ls rootDirectory) is roughly equivivalent to (taken from here)

files = [f for f in os.listdir('<rootDirectory>') if os.path.isfile(f)]
for f in files:
    # do something

All together:

import os # don't forget to import required modules
import re

output_file = open('C:\\temp\\output.txt', 'w')

def process(line):
    result = line
    result = re.sub("\"","", result)
    result = re.sub("\.","", result)
    # do all the stuff your sed script does and than
    return result

files = [f for f in os.listdir('.') if os.path.isfile(f)]
for file in files:
    file_handle = open(file_name, 'r')
    lines = file_handle.readlines()
    processed = map(process, lines)
    for line in processed:
        output_file.write(line)

Refer to the Python documentation for regex and file operations for details.

You might want to try converting your sed script to Python automatically, but if it's a one time requirement it's simpler to do it by hand.

Community
  • 1
  • 1
J0HN
  • 26,063
  • 5
  • 54
  • 85
  • Thank you John! But what I'm trying to avoid is to make a nested loop, I don't want to check every regular expression with every text file, but I want to compare every text file with an only regular expressions file. – Mad Max Apr 03 '14 at 12:19
  • @MadMax I'm not familiar with sed, so could you please explain a bit more? Does it mean you want to apply regex on line N *only* to a N-th file in directory? – J0HN Apr 03 '14 at 12:38
  • The ***sed*** command in bash let you do something equivalent to ***re.sub()***, but in addition it also lets you use a file with regex to be used instead of calling ***sed*** for each regex you want to use. What I want is call just once the python ***re.sub()*** command (or its equivalent) for each text file, as I'd do in bash. – Mad Max Apr 03 '14 at 13:21
  • Well, you can still use your belowed `sed.clean`, provided you create some kind of converter from it's contents to python regexes. And there still will be some loop. Python is not sed, it can't consume sed scripts, unless you tell him how to do that. There are some [approaches](http://stackoverflow.com/questions/4427542/how-to-do-sed-like-text-replace-in-python), but if I understand right there are no tool that can do that for you automagically. So you've got an opportunity to enrich the community by solving your own problem and publishing the code. – J0HN Apr 03 '14 at 13:46
  • @MadMax actually, there's a loop in `sed` solution as well, it just hidden deep inside `sed`, so it's kind of transparent to you. – J0HN Apr 03 '14 at 13:48