How to approach find+replace with RST files using Python or Grep

Question

I'm trying to automate a find+replace for a series of broken image links in .rst files. I have a csv file where column A is the "old" link (which is seen in the .rst files) and column B is the new replacement link for each row.

I can't use pandoc first to convert to HTML because it "breaks" the rst file. I did this once for a set of HTML files using BeautifulSoup and regex,but that parser wont work for my rst files.

A coworker suggested trying Grep, but I can't seem to figure out how to call in the csv file to make the "match" and switch.

for the html files, it would loop through each file, search for an img tag and replace links using the csv file as a dict

with open(image_csv, newline='') as f:
reader = csv.reader(f)
next(reader, None)  # Ignore the header row
for row in reader:
    graph_main_nodes.append(row[0])
    graph_child_nodes.append(row[1:])
graph = dict(zip(graph_main_nodes, graph_child_nodes))  # Dict with keys in correct location, vals in old locations

graph = dict((v, k) for k in graph for v in graph[k])

for fixfile in html:
try:
    with open(fixfile, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'html.parser')
        tags =  soup.findAll('img')
        for tag in tags:  
            print(tag['src'])
            if tag['src'] in graph.keys():
                tag['src'] = tag['src'].replace(tag['src'], graph[tag['src']])
                replaced_links += 1
                print("Match found!")
            else:
                orphan_links.append(tag["src"])
                print("Ignore")

I would love some suggestions on how to approach this. I'd love to repurpose my BeautifulSoup code but I'm not sure if that's realistic.

Do you have to replace all occurrences of `old link` with `new link` or just a particular one? — thehole, May 29 '19 at 14:11
All occurrences. There's about 10k different old links, each are used 1-5 times in the entire set of files. — ksprk, May 29 '19 at 15:05

thehole · Answer 1 · 2019-05-29T15:43:09.230

This question has information on parsing an RST file, but I don't think it's necessary. Your question boils down to replace textA with textB. You already have your graph with the csv loaded, so should be ok with just this (credit to this answer)

# Read in the file
filedata = None
with open('fixfile', 'r', encoding='utf-8') as file:
  filedata = file.read()

# Replace the target strings
for old, new in graph.items():
  filedata.replace(old, new)

# Write the file out again
with open('fixfile', 'w', encoding='utf-8') as file:
  file.write(filedata)

This is also a good candidate for sed or perl. Using something like this answer Also used this answer for help on specifying a rare delimiter for sed. (change the -n to -i and the p to g after testing to get it to actually save the file):

DELIM=$(echo -en "\001");
IFS=","
cat csvFile | while read PATTERN REPLACEMENT  # You feed the while loop with stdout lines and read fields separated by ":"
do
   sed -n "\\${DELIM}${PATTERN}${DELIM},\\${DELIM}${REPLACEMENT}${DELIM}p" fixfile.rst
done

How to approach find+replace with RST files using Python or Grep

1 Answers1