1

Firt thing I'd like to say is this place has helped me more than I could ever repay. I'd like to say thanks to all that have helped me in the past :).

I am trying to devide up some text from a specific style message. It is formated like this:

DATA|1|TEXT1|STUFF: some random text|||||
DATA|2|TEXT1|THINGS: some random text and|||||
DATA|3|TEXT1|some more random text and stuff|||||
DATA|4|TEXT1|JUNK: crazy randomness|||||
DATA|5|TEXT1|CRAP: such random stuff I cant believe how random|||||

I have code shown below that combines the text adding a space between words and adds it to a string named "TEXT" so it looks like this:

STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random

I need it formated like this:

DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|CRAP: |||||
DATA|9|NEWTEXT|such random stuff I cant believe how random|||||

The line numbers are easy, I have that done as well as the carraige returns. What I need is to grab "CRAP" and change the part that says "TEXT1" to "NEWTEXT".

My code scans the string looking for keywords then adds them to their own line then adds text below them followed by the next keyword on its own line etc. Here is my code I have so far:

#this combines all text to one line and adds to a string
while current_segment.move_next('DATA')
    TEXT = TEXT + " " + current_segment.field(4).value

KEYWORD_LIST  = [STUFF:', THINGS:', JUNK:']
KEYWORD_LIST1 = [CRAP:']

#this splits the words up to search through
TEXT_list = TEXT.split(' ')

#this searches for the first few keywords then stops at the unwanted one
for word in TEXT_list:
    if word in KEYWORD_LIST:
        my_output = my_output + word
    elif word in KEYWORD_LIST1:
        break
    else:
        my_output = my_output + ' ' + word

#this searches for the unwanted keywords leaving the output blank until it reaches the wanted keyword
for word1 in TEXT_list:
    if word1 in KEYWORD_LIST:
        my_output1 = ''
    elif word1 in KEYWORD_LIST1:
        my_output1 = my_output1 + word1 + '\n'
    else:
        my_output1 = my_output1 + ' ' + word1

#my_output is formatted back the way I want deviding up the text into 65 or less character lines

MAX_LENGTH = 65
my_wrapped_output  = wrap(my_output,MAX_LENGTH)
my_wrapped_output1 = wrap(my_output1,MAX_LENGTH)
my_output_list     = my_wrapped_output.split('\n')
my_output_list1    = my_wrapped_output1.split('\n')

for phrase in my_output_list:
     if phrase == "":
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|TEXT| |||||"
     else:
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|TEXT|" + phrase + "|||||"

for phrase2 in my_output_list1:
     if phrase2 == "":
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|NEWTEXT| |||||"
     else:
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|NEWTEXT|" + phrase + "|||||"

#this populates the fields I need
value = output

Then I format the "my_output" and "my_output1" adding the word "NEWTEXT" where it goes. This code runs through each line looking for the keyword then puts that keyword and a carraige return in. Once it gets the other "KEYWORD_LIST1" it stops and drops the rest of the text then starts the next loop. My problem is the above code gives my this:

DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|crazy randomness|||||
DATA|9|NEWTEXT|CRAP: |||||
DATA|10|NEWTEXT|such random stuff I cant believe how random|||||

It is grabbing the text from before "KEYWORD_LIST1" and adding it into the NEWTEXT section. I know there is a way to make groups from the keyword and text after it but I am unclear on how to impliment it. Any help would be much appreciated.

Thanks.

This is what I had to do to get it to work for me:

KEYWORD_LIST  = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']

def text_to_message(text):
    result=[]
    for word in text.split():
        if word in KEYWORD_LIST or word in KEYWORD_LIST1:
            if result:
            yield ' '.join(result)
            result=[]
            yield word
        else:
            result.append(word)
    if result:
        yield ' '.join(result)

def format_messages(messages):
    title='TEXT1'
    for message in messages:
        if message in KEYWORD_LIST:
            title='TEXT1'
        elif message in KEYWORD_LIST1:
            title='NEWTEXT'
    my_wrapped_output  = wrap(message,MAX_LENGTH)
    my_output_list     = my_wrapped_output.split('\n')
    for line in my_output_list:
        if line = '':
            yield title + '|'
        else:
            yield title + '|' + line

for line in format_messages(text_to_message(TEXT)):
    if line = '':
        SetID +=1
        output = "DATA|" + str(SetID) + "|"
    else:
        SetID +=1
        output = "DATA|" + str(SetID) + "|" + line

#this is needed instead of print(line)
value = output 
Opy
  • 2,119
  • 3
  • 18
  • 22
  • As a matter of convention, `ALL_CAPS` and normal case aren't mixed in variable names. Your `TEXT_list` might be more aptly named `text_list`. Just a minor note. – brc Sep 19 '11 at 22:19
  • I might try the `csv` module rather than doing it yourself. – Dave Sep 19 '11 at 22:42
  • brc, that must have blead through from my java coding lol. I updated the above to give more details. – Opy Sep 20 '11 at 12:58
  • @Opy: Please show the input (`TEXT`?) and the desired output. – unutbu Sep 20 '11 at 13:05
  • Sorry, TEXT is the long line of data after all lines are combined. TEXT_list is created at the "split" part of the code. I put in the rest of my code so you can see how it ends up formatted the way I want. I should have added it to begin with but I though a simple fix to my if/else statememnt would fix it. It looks like I will have to go with regex. – Opy Sep 20 '11 at 13:32

1 Answers1

1
  1. General tip: Don't try to build up strings accretively like this:

    my_output = my_output + ' ' + word
    

    instead, make my_output a list, append word to the list, and then, at the very end, do a single join: my_output = ' '.join(my_output). (See text_to_message code below for an example.) Using join is the right way to build strings. Delaying the creation of the string is useful because processing lists of substrings is more pleasant than splitting and unsplitting strings, and having to add spaces and carriage returns here and there.

  2. Study generators. They are easy to understand, and can help you a lot when processing text like this.


import textwrap

KEYWORD_LIST  = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']

def text_to_message(text):
    result=[]
    for word in text.split():
        if word in KEYWORD_LIST or word in KEYWORD_LIST1:
            if result:
                yield ' '.join(result)
                result=[]
            yield word
        else:
            result.append(word)
    if result:
        yield ' '.join(result)

def format_messages(messages):
    title='TEXT1'
    num=1
    for message in messages:
        if message in KEYWORD_LIST:
            title='TEXT1'
        elif message in KEYWORD_LIST1:
            title='NEWTEXT'
        for line in textwrap.wrap(message,width=65):
            yield 'DATA|{n}|{t}|{l}'.format(n=num,t=title,l=line)
            num+=1

TEXT='''STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random'''

for line in format_messages(text_to_message(TEXT)):
    print(line)
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I was hoping I didn't leave too much detail out, but I think I did. Before the code I listed above it takes all words and combines them together adding a space between each word, so really it is looking for key words then adding the keyword to its own line then the text below that looking for the next keyword. Once it gets through the above code it then devides it up into lines no more than 65 characters long using some other code. So TEXT_list above is one line with all the text in it leaving out the DATA|1|TEXT1| and the trailing |||||. Also the keywords may be different than each other. – Opy Sep 20 '11 at 12:32
  • I edited my post above to be more clear. I will check out what you posted because it looks like it can help :) – Opy Sep 20 '11 at 12:38
  • I edited the question to make it more exact. Thanks so far for your help :) – Opy Sep 20 '11 at 15:56
  • Much better but it doesn't like the word "yield". I'm getting a syntax error on all rows that have "yield". – Opy Sep 20 '11 at 18:35
  • Are you mixing `return` and `yield` in the same function? That's a no-no. If not, you've got to post your code. – unutbu Sep 20 '11 at 18:44
  • My code is exactly as you posted above. I can put a # in front of the line that has yield and the syntax error shows the next line that says yield. It keeps doing it for every line that has yield in it. Is there another way of getting the data besides yield? – Opy Sep 20 '11 at 18:58
  • ok I had to add "from __future__ import generators" to enable it in this software. I guess its onhly using python 2.2. My next issue is that I dont think this software has the textwrap module to import. And here I was thinking this would be a simple project lol. – Opy Sep 20 '11 at 19:16
  • 1
    @Opy: Hm, well, your own code used some function called `wrap`. Perhaps substitute it for `textwrap.wrap`. – unutbu Sep 20 '11 at 20:12
  • The wrap function was in another script in this program (there is around 20 places for different python scripts lol). They probably wrote that script because the textwrap was missing. I am going in behind someone elses scripts also and trying to work with what they have. Thats why all the "output = output + word" and other funky stuff lol. This program has a limited amount of python, not the entire language. – Opy Sep 20 '11 at 21:22
  • It still needs some tweeking but it works! It is putting one letter per line so I just need to fix whatever I screwed up there lol. Thanks a million! – Opy Sep 20 '11 at 22:37
  • Ok I got it working 100% I posted my working code above. Thanks again for the help! :) – Opy Sep 21 '11 at 00:22