Extract strings in python

Question

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file...

...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....

I want something like if I do:-

data = foo("file.txt")

I get:-

data = ['AAA','BBB','CCC','DDD']

What is the best possible way? My file is not big...

Basically, I want to extract "remaining upload data transfer" from this file which in HTML looks like THIS

score 2 · Accepted Answer · answered Mar 17 '10 at 17:48

You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It's rarely perfect and this causes problems when you rely on it for data.

I would personally use BeautifulSoup. It does do more than you're asking but also at superfraction of the effort.

score 0 · Answer 2 · answered Mar 17 '10 at 17:40

0

You want BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)

soup.find("font", "textfont")

answered Mar 17 '10 at 17:40

Dominic Rodger

97,747
36
197
212

I want to do it without using a third party library.. Bcos, I dont really want html processing.. My aim is just to extract those strings.. – shadyabhi Mar 17 '10 at 17:42
1

@shadyabhi, Not using a library is a silly goal. An HTML parser is the right tool for what you are trying to do (which is parsing HTML) and provides a way to write a simple, concise function. – Mike Graham Mar 17 '10 at 17:46
@Dominic, lxml is probably a better choice these days, as it is still actively developed. – Mike Graham Mar 17 '10 at 17:46

score 0 · Answer 3 · answered Mar 17 '10 at 17:50

0

def foo():
    input_file = open("myfile.txt", 'r')
    input = ''.join(input_file.readlines())

    looking_for = ['AAA', 'BBB', 'CCC', 'DDD']
    have = []

    for thing in looking_for:
        if thing in input:
            have.append(thing)
    return have

answered Mar 17 '10 at 17:50

inspectorG4dget

110,290
27
149
241

I think that won't present the ordering if more than one item is present in the same line... – fortran Mar 17 '10 at 17:59
I don't know what you mean by "ordering". I see no such specification in the question. And my algorithm will find all the strings in looking_for that are in the html, even if they are in the same line. – inspectorG4dget Mar 19 '10 at 01:28

score 0 · Answer 4 · answered Mar 17 '10 at 17:51

0

In a case like this it's, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a f = open() f.read() and your own parser.

answered Mar 17 '10 at 17:51

zellio

31,308
1
42
61

score 0 · Answer 5 · edited May 23 '17 at 11:48

If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:

import HTMLParser

class DataOnlyParser(HTMLParser.HTMLParser):
    def parse(self, text):
        self.result = []
        self.feed(text)
        self.close()
        return self.result

    def handle_data(self, data):
        data = data.strip()
        if data:
            self.result.append(data)

p = DataOnlyParser()

data = """
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
"""

print p.parse(data)
# ['AAA', 'BBB', 'CCC', 'DDD']

If your selection criteria is more complex though, and/or if the input is malformed, you'd probably be better off with a library like lxml.

You do NOT want to use regular expressions to "parse" html. See here.

Extract strings in python

5 Answers5