I have a set of files in which I've tagged the beginning of paragraphs and sentences, but I need to iterate over that each file so that each paragraph and each sentence in a file has a unique numerical ID. I believe that this can be done with str.replace or with the Regular Expression module.
In the external files, sentence opening tags are marked as follows:
<p id="####"> # 4 for paragraphs
<s id="#####"> # 5 for sentences
So here I do the work of calling the external files and calling the paragraph and sentence numbering functions (in separate module), which doesn't work.
import re, fileinput, NumberRoutines
ListFiles = ['j2vch34.txt', '79HOch16.txt']
with fileinput.input(files=(ListFiles), inplace=True, backup='.bak') as f:
for filename in ListFiles:
with open(filename) as file:
text = file.read() # read file into memory
text = NumberRoutines.NumberParas(text)
text = NumberRoutines.NumberSentences(text)
with open(filename, 'w') as file:
file.write(text)
In NumberRoutines, I've tried to apply numbering, this with example of paragraphs:
def NumberParas(text):
sub = "p id="
str = text
totalparas = str.count(sub, 0, len(str))
counter = 0
for paranumber in range(totalparas):
return str.replace('p id="####"', 'p id="{paranumber}"'.format(**locals()))
counter += 1
Following R Nar's response below, I have repaired that from earlier, so that I no longer get an error. It re-writes the file, but the paranumber is always 0.
The second way that I've tried to apply numbering, this time with sentences:
def NumberSentences(text):
sub = "s id="
str = text
totalsentences = str.count(sub, 0, len(str))
counter = 0
for sentencenumber in range(totalsentences):
return str.replace('s id="#####"', 's id="{counter}"'.format(**locals()))
counter += 1
Former type error (Can't convert 'int' object to str implicitly) resolved.
It's reading and rewriting the files, but all sentences are being numbered 0.
Two other questions: 1. Do I need the **locals for local scoping of variables inside the for statement? 2. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex.
I have read https://docs.python.org/3.4/library/stdtypes.html#textseq And chapter 13 of Mark Summerfields Programming in Python 3, and was influenced by Dan McDougall's answer on Putting a variable inside a string (python)
Several years ago I struggled with the same thing in PERL, 2009 Query to PERL beginners, so sigh.