Python txt file tag parse

Question

I am trying to parse out the contents of two different tags in a txt file and I am getting all the instances of the first tag "p" but not the second "l". Is the problem with the "or"?

Thanks for the help. Here is the code I am using

with open('standardA00456.txt','w') as output_file:
    with open('standardA00456.txt','r') as open_file:
            the_whole_file = open_file.read()
            start_position = 0

            while True:

               start_position = the_whole_file.find('<p>' or '<l>', start_position)

               end_position = the_whole_file.find('</p>' or '</l>', start_position)
               data = the_whole_file[start_position:end_position+5]


               output_file.write(data + "\n")
               start_position = end_position

This is an HTML or XML file, not a plain text file, right? Because text files don't have "tags", they have no more structure beyond characters and lines. And I'm not bringing this up to be pedantic; if you want to parse HTML or XML, you should be using a parser, like `ElementTree` or `BeautifulSoup`, not trying to do it this way. — abarnert, Aug 10 '14 at 15:28
I appreciate your comment and If I was trying to parse an xml file it would be the proper way. The file I have is a txt file that has been manually marked with html-like tags. I don't know how to use element tree so an example would be helpful. — English Grad, Aug 10 '14 at 15:42
@EnglishGrad: Then google for "ElementTree example" or "ElementTree tutorial" or similar. Any example someone gives you in a comment here will be nowhere near as good. — abarnert, Aug 10 '14 at 16:20

score 1 · Accepted Answer · edited May 23 '17 at 12:11

'<p>' or '<l>' will always equal '<p>', as it tells Python to use '<l>' only if '<p>' is None, False, numeric zero, or empty. And as the string '<p>' is never one of those, '<l>' is always skipped:

>>> '<p>' or '<l>'
'<p>'
>>> None or '<l>'
'<l>'

Instead you can easily use re.findall:

import re
with open('standardA00456.txt','w') as out_f,  open('standardA00456.txt','r') as open_f:
    p_or_ls = re.findall(r'(?:<p>.*?</p>)|(?:<l>.*?</l>)', 
                         open_f.read(), 
                         flags=re.DOTALL) #to include newline characters
    for p_or_l in p_or_ls:
        out_f.write(p_or_l + "\n")

However, parsing files with tags (such as HTML and XML) using regex is not a good idea. Using a module, such as BeautifulSoup is safer:

from bs4 import BeautifulSoup
with open('standardA00456.txt','w') as out_f,  open('standardA00456.txt','r') as open_f:
    soup = BeautifulSoup(open_f.read())
    for p_or_l in soup.find_all(["p", "l"]):
        out_f.write(p_or_l + "\n")

Minor quibble: Falsey is usually described as "`None`, `False`, **numeric zero**, or empty". Sure, `0` is literally empty if you're defining numbers on top of set theory, but I don't think most people think of numbers that way. — abarnert, Aug 10 '14 at 16:19
@abarnert You're right, thank you. I edited my answer to correct that. — dwitvliet, Aug 10 '14 at 16:21

score 0 · Answer 2 · answered Aug 10 '14 at 16:20

English Grad, I think you need to improve the logic. I modified your code and came up with this:

with open('standardA00456.txt','w') as output_file:
    with open('standardA00456.txt','r') as open_file:
        the_whole_file = open_file.read()
        start_position = 0

        found_p = False
        fould_l = False

        while True:
            start_pos_p = the_whole_file.find('<p>', start_position)
            start_pos_l = the_whole_file.find('<l>', start_position)

            if start_pos_p > -1 and start_pos_l > -1:
                if start_pos_p < start_pos_l:
                    found_p = True
                    start_position = start_pos_p
                    found_l = False
                else:
                    found_l = True
                    start_position = start_pos_l
                    found_p = False
            elif start_pos_p > -1:        
                found_p = True
                start_position = start_pos_p
                found_l = False
            elif start_pos_l > -1:        
                found_l = True
                start_position = start_pos_l
                found_p = False
            else:
                break

            if found_p:
                end_position = the_whole_file.find('</p>', start_position)

            elif found_l:
                end_position = the_whole_file.find('</l>', start_position)

            else:
                break

            data = the_whole_file[start_position:end_position+5]
            output_file.write(data + "\n")
            start_position = end_position

This is a bit excessive, when the same can be done with 5 lines of code. — dwitvliet, Aug 10 '14 at 16:30
What if someone doesn't want to use regular expression or parsing library like beautiful soup? I just improved his code a bit and this should solve his problem. — Tamim Shahriar, Aug 10 '14 at 16:41

Python txt file tag parse

2 Answers2