2
<div class="members_box_second">
                    <div class="members_box0">
                        <p>1</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>
                        <p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
                        <p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Ukkadam South</p>
                        <p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
                        <p class="clear"><b>Email:</b><span><a href="mailto:jagadhesan@infognana.com">jagadhesan@infognana.com</a></span></p>                       
                    </div>
</div>
<div class="members_box">
                    <div class="members_box0">
                        <p>2</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>

                        <p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Alagar Nivas, 284 NSR Road</p>
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>      
                        <h4>Factory Address</h4>
                        Coimbatore - 641 027
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>
                    </div>
</div>

I have the above structure. From that I am trying to scrape the texts inside div of class members_box1 and members_box2 only.

I have the following script which does get data from only members_box1

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print '\n'

This is how I tried to get data from both the boxes

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print data

But I am getting the same result as I get for just members_box1

UPDATE

I want to the output to be like this (in single line) for an iteration

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

But I am getting as follows

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861
Venkateshwaran Selvaraj
  • 1,745
  • 8
  • 30
  • 60

2 Answers2

3

The problem is that you're adding eachbox2 to each data, instead of to the list of things to loop over.

On top of that, you've got a stray space, 'div ' instead of 'div', that causes eachbox2 to be an empty list.

Try this:

eachbox1 = soup.findAll('div', {'class':'members_box1'})
eachbox2 = soup.findAll('div', {'class':'members_box2'})
for eachuniversity in eachbox1 + eachbox2:
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]

This isn't really the best way to do things, it's just the simplest fix for your existing way of doing things. BeautifulSoup offers various different ways to search for multiple things in one query—e.g., you can search based on a tuple of values ('members_box1', 'members_box2'), or a regexp (re.compile(r'members_box[12]')), or a filter function (lambda c: c in 'members_box1', 'members_box2')…

abarnert
  • 354,177
  • 51
  • 601
  • 671
3

You could use regex to match either members_box1 or members_box2:

import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:

For example,

import bs4 as bs
import urllib2
import re
import csv

page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)

with open('/tmp/ccc.csv', 'wb') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='\n', )
    eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
    for pair in zip(*[iter(eachbox)]*2):
        writer.writerow([text.strip() for item in pair for text in item.stripped_strings])

Note that you must remove the stray space after div in

soup.findAll('div ')

in order to find any <div> tags.


The code above uses the very handy grouper idiom:

zip(*[iter(iterable)]*n)

This expression collects n items from iterable and groups them into a tuple. So this expression allows you to iterate over chunks of n items. I've made a poor attempt to explain how the grouper idiom works here.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • now how do I remove that [ ] and the start and the end as well? – Venkateshwaran Selvaraj Nov 14 '13 at 12:53
  • The brackets are printed because `data` is a list. If you wish to print the contents of `data` joined together with a comma (for example), you could use `print(', '.join(data))`. (I've edited the code above to show what I mean). – unutbu Nov 14 '13 at 12:59
  • It is working, but one last thing. I want the data to be in a single line for every for loop. – Venkateshwaran Selvaraj Nov 14 '13 at 13:24
  • Your console or text editor may be wrapping the text on to multiple lines, but Python is printing a single line for every iteration of the loop. – unutbu Nov 14 '13 at 13:29
  • No. I am writing it into a csv file like this `python file.py > ccc.csv` It is coming in two lines for a single iteration – Venkateshwaran Selvaraj Nov 14 '13 at 13:31
  • Hm, running the code I posted above, I am not able to reproduce the problem. I am seeing only one line for every iteration. Nevertheless, since you are making a csv file, it is better to use the `csv` module since it will get handle text with quotes and commas properly (unlike `', '.join(data)`. I'll edit the post above to show how. – unutbu Nov 14 '13 at 14:01
  • I am still getting it in two lines. Is it because I am using CYGWIN and windows? – Venkateshwaran Selvaraj Nov 14 '13 at 14:10
  • Yes, it probably has to do with my inexperience with Windows. Try omitting `lineterminator='\n'` or try using `lineterminator='\r\n'`. – unutbu Nov 14 '13 at 14:31
  • Please post the `repr` of the output you are getting. – unutbu Nov 14 '13 at 17:04
  • Oh, the problem is due to terminology. Each iteration of the loop is finding one `div` tag (of either the `members_box1` or `members_box2` class). The code was printing the text associated with that `
    `. What your update shows is the text from **two** `` joined into one line. You can do that using the `grouper` idiom. I've updated my post to show how.
    – unutbu Nov 14 '13 at 17:16
  • It worked. Thanks a gazillion. Where in this world you people live.? – Venkateshwaran Selvaraj Nov 14 '13 at 17:32