2

I have a 2 set of data i crawled from a html table using regex expression

data:

 <div class = "info"> 
   <div class="name"><td>random</td></div>
   <div class="hp"><td>123456</td></div>
   <div class="email"><td>random@mail.com</td></div> 
 </div>

 <div class = "info"> 
   <div class="name"><td>random123</td></div>
   <div class="hp"><td>654321</td></div>
   <div class="email"><td>random123@mail.com</td></div> 
 </div>

regex:

matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1)
matchhp = re.search('\<div class="hp"><td>(.*?)</td>' , match3).group(1)
matchemail = re.search('\<div class="email"><td>(.*?)</td>' , match3).group(1)

so using the regex i can take out

random

123456

random@mail.com

so after saving this set of data into my database i want to save the next set how do i get the next set of data? i tried using findall then insert into my db but everything was in 1 line. I need the data to be in the db set by set.

New to python please comment on which part is unclear will try to edit

JustASimpleGuy
  • 171
  • 1
  • 1
  • 11
  • 4
    You shouldn't be parsing html with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Keatinge May 20 '16 at 02:15
  • Could you either post the parent tags of the html blocks you've presented or the complete HTML input you've got? Thanks. – alecxe May 20 '16 at 02:29
  • @alecxe added in the parents tags the regex i am using is the 2nd boundary the first 1 one was using the parents tags to take it out – JustASimpleGuy May 20 '16 at 08:09
  • If you have a follow-up problem, please consider creating a separate question. Thanks. – alecxe May 23 '16 at 03:47

2 Answers2

2

You should not be parsing HTML with regex. It's just a mess, do it with BS4. Doing it the right way:

soup = BeautifulSoup(match3, "html.parser")
names = []
allTds = soup.find_all("td")
for i,item in enumerate(allTds[::3]):
    #            firstname   hp                email
    names.append((item.text, allTds[(i*3)+1].text, allTds[(i*3)+2].text))

And for the sake of answering the question asked I guess I'll include a horrible ugly regex that you should never use. ESPECIALLY because it's html, don't ever use regex for parsing html. (please don't use this)

for thisMatch in re.findall(r"<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>", match3, re.DOTALL):
    print(thisMatch[0], thisMatch[1], thisMatch[2])
Keatinge
  • 4,330
  • 6
  • 25
  • 44
  • I see people screaming not to use regex for parsing HTML, but what does Beautiful Soup, or lxml use internally to parse the HTML? Also why is it bad to use Regex to parse HTML? – smac89 May 20 '16 at 02:37
  • @Smac89 Because this whole thing will break if the website gets updated and the td changes to `` or anything. BeautifulSoup knows how to handle those things. This regex will only catch things in between `` and `` exactly while BS4 will work regardless. – Keatinge May 20 '16 at 02:38
  • @Racialz Because this is a demo website i created i wont ever update the website and i was asked not to use any parser as the regex is enough Thanks for answering either way – JustASimpleGuy May 20 '16 at 08:11
1

As @Racialz pointed out, you should look into using HTML parsers instead of regular expressions.

Let's take BeautifulSoup as well as @Racialz did, but build a more robust solution. Find all info elements and locate all fields inside producing a list of dictionaries in the output:

from pprint import pprint

from bs4 import BeautifulSoup

data = """
<div>
    <div class = "info">
       <div class="name"><td>random</td></div>
       <div class="hp"><td>123456</td></div>
       <div class="email"><td>random@mail.com</td></div>
    </div>

    <div class = "info">
       <div class="name"><td>random123</td></div>
       <div class="hp"><td>654321</td></div>
       <div class="email"><td>random123@mail.com</td></div>
    </div>
</div>
 """
soup = BeautifulSoup(data, "html.parser")

fields = ["name", "hp", "email"]

result = [
    {field: info.find(class_=field).get_text() for field in fields}
    for info in soup.find_all(class_="info")
]

pprint(result)

Prints:

[{'email': 'random@mail.com', 'hp': '123456', 'name': 'random'},
 {'email': 'random123@mail.com', 'hp': '654321', 'name': 'random123'}]
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • This is definitely better, when I wrote my answer op did not include the info divs so that's why my bs4 code is all weird. If you check the ops edit history when I wrote my answer it was just 6 '' – Keatinge May 20 '16 at 16:15