I want to scrape data into csv file with proper format using python

Question

I wrote a almost complete code for my task but there is one problem with data storage. When I run only single page my data is fine but when I try to run 20 pages and store data into csv format then I'm getting error with format please have a look at my code and guide me how to fix it. thanks

here is my code:

import requests
from bs4 import BeautifulSoup
#import pandas as pd
#import pandas as pd
import csv

def get_page(url):
    response = requests.get(url)
    if not response.ok:
        print('server responded:', response.status_code)
    else:
        soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
    return soup

def get_detail_page(soup):

     try:
        title = (soup.find('h1',class_="cdm_style",id=False).text)
     except:
         title = 'Empty Title'
     try:
         collection = (soup.find('td',id="metadata_collec").find('a').text)
     except:
         collection = "Empty Collection"
     try:
         author = (soup.find('td',id="metadata_creato").text)
     except:
         author = "Empty Author"
     try:
         abstract = (soup.find('td',id="metadata_descri").text)
     except:
         abstract = "Empty Abstract"
     try:
         keywords = (soup.find('td',id="metadata_keywor").text)
     except:
         keywords = "Empty Keywords"
     try:
         publishers = (soup.find('td',id="metadata_publis").text)
     except:
         publishers = "Empty Publishers"
     try:
         date_original = (soup.find('td',id="metadata_contri").text)
     except:
         date_original = "Empty Date original"
     try:
        date_digital = (soup.find('td',id="metadata_date").text)
     except:
        date_digital = "Empty Date digital"
     try:
        formatt = (soup.find('td',id="metadata_source").text)
     except:
        formatt = "Empty Format"
     try:
        release_statement = (soup.find('td',id="metadata_rights").text)
     except:
        release_statement = "Empty Realease Statement"
     try:
        library = (soup.find('td',id="metadata_librar").text)
     except:
        library = "Empty Library"
     try:
        date_created = (soup.find('td',id="metadata_dmcreated").text)
     except:
        date_created = "Empty date Created"
     data = {
         'Title'        : title,
         'Collection'   : collection,
         'Author'       : author,
         'Abstract'     : abstract,
         'Keywords'     : keywords,
         'Publishers'   : publishers,
         'Date_original': date_original,
         'Date_digital' : date_digital,
         'Format'       : formatt,
         'Release-st'   : release_statement,
         'Library'      : library,
         'Date_created' : date_created


     }
     return data
def get_index_data(soup):
    try:
        titles_link = soup.find_all('a',class_="body_link_11")
    except:
        titles_link = []
    else:
        titles_link_output = []
        for link in titles_link:
            try:
                item_id = link.attrs.get('item_id', None) #All titles with valid links will have an item_id
                if item_id:
                    titles_link_output.append("{}{}".format("http://cgsc.cdmhost.com",link.attrs.get('href', None)))
            except:
                continue
    return titles_link_output
def write_csv(data,url):
    with open('123.csv','a') as csvfile:
        writer = csv.writer(csvfile)
        row = [data['Title'], data['Collection'], data['Author'],
        data['Abstract'], data['Keywords'], data['Publishers'], data['Date_original'],
        data['Date_digital'], data['Format'], data['Release-st'], data['Library'],
        data['Date_created'], url]
        writer.writerow(row)
def main():
    #url = "http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1"
    mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
    #get_page(url)
    products = get_index_data(get_page(mainurl))
    for product in products:
        data = get_detail_page(get_page(product))
        write_csv(data,product)
    #write_csv(data,url)


if __name__ == '__main__':
    main()

I have run your code, maybe there is white spaces in data right? you can use `strip()` method which removes white spaces from text. — Manali Kagathara, Mar 16 '20 at 13:23
"then I'm getting error with format" - you'll need to be more specific. Are you getting an exception or is the output not as you expected? If so then you'll need to show what you are currently getting and what you were expecting instead. — , Mar 16 '20 at 13:34
Could you please run my code and check output actually there are no options to add a screenshot of my output. — M.Akram, Mar 16 '20 at 13:57

score 0 · Accepted Answer · answered Mar 16 '20 at 15:53

As indicated in the comments, your text retrieved from the website appears to contain white space. You can remove these spaces by using the strip method. This could be done when you construct your data dictionary, i.e.:

data = {                                                                   
    'Title': title.strip(),                                        
    'Collection': collection.strip(),                                   
    'Author': author.strip(),                                       
    'Abstract': abstract.strip(),                                     
    'Keywords': keywords.strip(),                                     
    'Publishers': publishers.strip(),                                   
    'Date_original': date_original.strip(),                                
    'Date_digital': date_digital.strip(),                                 
    'Format': formatt.strip(),                                      
    'Release-st': release_statement.strip(),                            
    'Library': library.strip(),                                      
    'Date_created': date_created.strip()                                  
}

thank you man! It works for me but if you don't mind may I ask one thing more? Sir everything is ok now but my code not going to next page for pagination. — M.Akram, Mar 16 '20 at 16:36
Glad I could help - please upvote and accept the answer if I have answered your question. You would need to configure another URL for that, or attempt to extract the "Next page" link from the HTML response. That's really a topic for a further question if you can't find a solution by doing some research. — dspencer, Mar 16 '20 at 16:37
sure, here is my new question link: https://stackoverflow.com/questions/60710134/my-script-are-not-going-to-next-page-for-scraping — M.Akram, Mar 16 '20 at 16:59

I want to scrape data into csv file with proper format using python

1 Answers1