3

I have a set of files in which I've tagged the beginning of paragraphs and sentences, but I need to iterate over that each file so that each paragraph and each sentence in a file has a unique numerical ID. I believe that this can be done with str.replace or with the Regular Expression module.

In the external files, sentence opening tags are marked as follows:

<p id="####"> # 4 for paragraphs
<s id="#####"> # 5 for sentences

So here I do the work of calling the external files and calling the paragraph and sentence numbering functions (in separate module), which doesn't work.

import re, fileinput, NumberRoutines
ListFiles = ['j2vch34.txt', '79HOch16.txt']

with fileinput.input(files=(ListFiles), inplace=True, backup='.bak') as f:
    for filename in ListFiles:
        with open(filename) as file: 
            text = file.read() # read file into memory
        text = NumberRoutines.NumberParas(text)
        text = NumberRoutines.NumberSentences(text)

    with open(filename, 'w') as file: 
        file.write(text) 

In NumberRoutines, I've tried to apply numbering, this with example of paragraphs:

def NumberParas(text):
    sub = "p id="
    str = text
    totalparas = str.count(sub, 0, len(str))
    counter = 0

    for paranumber in range(totalparas):
        return str.replace('p id="####"', 'p id="{paranumber}"'.format(**locals()))
        counter += 1

Following R Nar's response below, I have repaired that from earlier, so that I no longer get an error. It re-writes the file, but the paranumber is always 0.

The second way that I've tried to apply numbering, this time with sentences:

def NumberSentences(text):
    sub = "s id="
    str = text
    totalsentences = str.count(sub, 0, len(str))
    counter = 0

    for sentencenumber in range(totalsentences):
        return str.replace('s id="#####"', 's id="{counter}"'.format(**locals()))
        counter += 1

Former type error (Can't convert 'int' object to str implicitly) resolved.

It's reading and rewriting the files, but all sentences are being numbered 0.

Two other questions: 1. Do I need the **locals for local scoping of variables inside the for statement? 2. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex.

I have read https://docs.python.org/3.4/library/stdtypes.html#textseq And chapter 13 of Mark Summerfields Programming in Python 3, and was influenced by Dan McDougall's answer on Putting a variable inside a string (python)

Several years ago I struggled with the same thing in PERL, 2009 Query to PERL beginners, so sigh.

Community
  • 1
  • 1
  • you need to `return text` for your methods. if you just `return` then when you have lines like `text = whatevermethod` then text will equal `None` since the function is not going to return a value. – R Nar Oct 30 '15 at 16:00
  • Thanks, I'll test and update in next 2 hours. – English Prof WRaabe Oct 30 '15 at 17:10

2 Answers2

1

i dont know why you have the inputfile line if you are already going to iterate through each file inside of the with block so I jsut took it out

for filename in ListFiles:
    with open(filename) as file: 
        text = file.read()
    text = NumberRoutines.NumberParas(text)
    text = NumberRoutines.NumberSentences(text)
    with open(filename, 'w') as file: 
        file.write(text) # produces error on this line

this uses the same logic. however, with your code, your writing block was outside of the for loop and would then only write to your last file in the file list.

now with the functions:

def NumberParas(text):
    #all that starting stuff can be eliminated with the for loop below
    returnstring = ''
    for i, para in enumerate(text.split('p id="####"')): # minor edit to match spacing in sample.
        if i:
            returnstring = returnstring + 'p id = "%d"%s' % (i-1,para)
        else:
            returnstring = para
    return returnstring

and similarily:

def NumberSentences(text):
    returnstring = ''
    for i, sent in enumerate(text.split('s id="#####"')): # minor edit to match spacing.
        if i:
            returnstring = returnstring + 's id = "%d"%s' % (i-1,sent) # minor edit for "sent" in this isntance
        else:
            returnstring = sent
return returnstring

the reason that i changed the logic is because str.replace replaces all instances of whatever you want to replace, not just the first. that means that the first time you call it, all tags would be replaced in the text and the rest of the for loop is useless. also, you need to actually return the string rather than just changing it in the function since string are immutable and so the string you have inside of the function is NOT the real string you want to change.

the internal if i: line is because the first item in the enumerated list is whatever is before the first tag. i assume that would be empty since the tags are before each sentence/paragraph but you may have whitespace or such

BTW: this can all be accomplished with a one liner because python:

>>> s = 'p tag asdfawegasdf p tag haerghasdngjh p tag aergaedrg'
>>> ''.join(['p tag%d%s' % (i-1, p) if i else p for i,p in enumerate(s.split('p tag'))])
'p tag0 asdfawegasdf p tag1 haerghasdngjh p tag2 aergaedrg'
R Nar
  • 5,465
  • 1
  • 16
  • 32
  • 1
    Thank you so much. With a few minor edits, I was able to get this to work. – English Prof WRaabe Oct 30 '15 at 19:52
  • no problem, mark it as a right answer for future reference/to give me points ;) – R Nar Oct 30 '15 at 19:53
  • By the way, needed the with for backups so can keep re-testing each step. Did not alter your suggested code on that, though. – English Prof WRaabe Oct 30 '15 at 19:57
  • I was doing some additional testing on this. And I realized that it has an unanticipated consequences. My file has elements marked

    and elements marked While this routine does return all the content of those elements, anything not so marked is not returned. I.e., all that precedes the first marked

    is eliminated. Why is that? I'll be trying to figure that out, perhaps with http://python-notes.curiousefficiency.org/en/latest/python_concepts/break_else.html But else: returnstring = returnstring (same level as if:) does not fix it.

    – English Prof WRaabe Oct 30 '15 at 23:32
  • I'm going to mark this as answered because R Nar answered the question as posed (and see if I can figure this out on my own, or will post another question based on R Nar's routine). But I notify anyone using this R Nar's solution does not return portions of the file that precede first

    and element.

    – English Prof WRaabe Oct 31 '15 at 16:06
  • sorry, there should be an else statement in there. the `if i` line ensures that i is not 0 because you dont want to add the tag again before the first line. if you use the one liner list comprehension/join, it does include all the pre amble. it has been edited tho, sorry for that – R Nar Oct 31 '15 at 17:49
  • That fixes it. And that is so cool, because I have about 40,000 paragraphs and 90,000 sentences to number. – English Prof WRaabe Oct 31 '15 at 18:39
  • glad i could help (y) – R Nar Oct 31 '15 at 18:53
1

TypeError: must be str, not None

Your NumberParas(text) returns nothing

TypeError: Can't convert 'int' object to str implicitly

Convert int i to str with str(i)

  1. Do I need the **locals for local scoping of variables inside the for statement?

You need the locals() function call to build your parameter dict automatically.

  1. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex
#!/usr/bin/env python3
import re

tok='####'
regex = re.compile(tok)

bar = 41
def foo(s):
    bar = 42
    return regex.sub("%(bar)i" % locals(), s)

s = 's id="####"'
print(foo(s))

output:

s id="42"
decltype_auto
  • 1,706
  • 10
  • 19