Is there a better or more efficient way to code this?

Question

import re
fr=open("test.html",'r')
i,j,tablestart=0,0,0
str=""
p=re.compile("<td.*?>(.*?)<\/td>")
for line in fr:
    if "<table" in line:
        tablestart=1
    elif "</table>" in line and tablestart==1:
        j,tablestart=0,0
    m=p.search(line)
    if m and tablestart==1:
        str+='"' + m.group(1) + '"' + ","
    if "</tr>" in line and tablestart==1:
        print(str)
        str=""

The code is creating csv file from html table. Is there a better or more efficient way to code this? I'm not looking for any html parsers.

"I'm not looking for any html parsers." - why? _That_ would be a better way. — georg, Nov 12 '13 at 12:32
I like to code something i need first before using other's code. i have a problem if there are two in one line any suggestions? — Mike L, Nov 12 '13 at 12:38
your code assumes that the html is split by newlines which is not always true, the whole table can be in one line. i'd go for parsers too. — Foo Bar User, Nov 12 '13 at 12:54
Please see the highest voted answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Hyperboreus, Nov 12 '13 at 14:00
Concerning your regex. What happens if you html contains e.g. ``. Seriously you can't parse a non-regular grammar with a regular expression. — Hyperboreus, Nov 12 '13 at 14:03
"I like to code something i need first before using other's code." - In the "real world", you use the tools out there that have solved the basic problems; so you can solve the _actual_ problem. This is not a new problem; and you are using the _wrong tool_ (regular expressions) to solve a problem that is solved by a parsing engine. If you want to solve it yourself; consider splitting the document into tokens and write a token parser. — Burhan Khalid, Dec 22 '13 at 15:35
You'll have more chance at: http://codereview.stackexchange.com/ — Toto, Dec 22 '13 at 17:28

score 1 · Answer 1 · answered Dec 22 '13 at 15:19

Maybe something like this:

for line in fr:
   if re.search(r'"<td.*?>.+?<\/td>"',line):
      line_table = re.findall(r'\>\.+?\<',line)
      var = line_table
      for var1 in var:
         if var1 != False:
             var2 = re.findall(r'\>\.+?\<',var1)[0]
             output.write(var2+','+'\n')
         else:
             output.write(','+'\n')

Is there a better or more efficient way to code this?

1 Answers1