0
import re
fr=open("test.html",'r')
i,j,tablestart=0,0,0
str=""
p=re.compile("<td.*?>(.*?)<\/td>")
for line in fr:
    if "<table" in line:
        tablestart=1
    elif "</table>" in line and tablestart==1:
        j,tablestart=0,0
    m=p.search(line)
    if m and tablestart==1:
        str+='"' + m.group(1) + '"' + ","
    if "</tr>" in line and tablestart==1:
        print(str)
        str=""

The code is creating csv file from html table. Is there a better or more efficient way to code this? I'm not looking for any html parsers.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Mike L
  • 1,955
  • 2
  • 16
  • 18
  • 7
    "I'm not looking for any html parsers." - why? _That_ would be a better way. – georg Nov 12 '13 at 12:32
  • I like to code something i need first before using other's code. i have a problem if there are two in one line any suggestions? – Mike L Nov 12 '13 at 12:38
  • 1
    your code assumes that the html is split by newlines which is not always true, the whole table can be in one line. i'd go for parsers too. – Foo Bar User Nov 12 '13 at 12:54
  • Please see the highest voted answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Hyperboreus Nov 12 '13 at 14:00
  • Concerning your regex. What happens if you html contains e.g. ``. Seriously you can't parse a non-regular grammar with a regular expression. – Hyperboreus Nov 12 '13 at 14:03
  • "I like to code something i need first before using other's code." - In the "real world", you use the tools out there that have solved the basic problems; so you can solve the _actual_ problem. This is not a new problem; and you are using the _wrong tool_ (regular expressions) to solve a problem that is solved by a parsing engine. If you want to solve it yourself; consider splitting the document into tokens and write a token parser. – Burhan Khalid Dec 22 '13 at 15:35
  • You'll have more chance at: http://codereview.stackexchange.com/ – Toto Dec 22 '13 at 17:28

1 Answers1

1

Maybe something like this:

for line in fr:
   if re.search(r'"<td.*?>.+?<\/td>"',line):
      line_table = re.findall(r'\>\.+?\<',line)
      var = line_table
      for var1 in var:
         if var1 != False:
             var2 = re.findall(r'\>\.+?\<',var1)[0]
             output.write(var2+','+'\n')
         else:
             output.write(','+'\n')
Pivopija
  • 48
  • 4