1

I am working in web scraping, I want to scrape the data through pagination and export it in csv file. While writing on CSV file we need to concatenate the 1st page data with other pages and write it into CSV file while writing the column name should not repeat. I am having an issue while writing the concatenate data every time the first page data is repeating.

enter code here
df =pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
print(df)
page_count = 1
while True:

   try:
       page_count +=1
    
       driver.find_element_by_link_text(str(page_count)).click()
       time.sleep(3)
       links = {i.find_element_by_tag_name('img').get_attribute('alt'):i.get_attribute('href') 
          for i in driver.find_elements_by_css_selector('[data-tracking="product-card"] 
          div:nth-child(1) > [href*=search]')}
       df1 =pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
       #print(df1.head())  
       df2 = pd.concat([df,df1]).drop_duplicates().reset_index(drop=True)
       print(df2.head())
       with open("Type1.csv",'a',encoding="utf-8") as f:
            df2.to_csv(f , index = False)
            f.close()
   except NoSuchElementException:
        break 
Cimbali
  • 11,012
  • 1
  • 39
  • 68
A.D
  • 135
  • 6

1 Answers1

0

What you’re doing for each page is basically these steps:

  • get links in df1
  • compute df2, which is df (links from the first page) + df1 (links from current page)
  • write df2 to a csv file

So the program is doing what you ask it to do by repeating the first page every time. If you want to avoid that, you should either:

  1. concatenate all pages together, then write the dataframe containing all pages’ links to csv
  2. write each pages’ links appending to the csv, skipping the header row.

Going with option 2 that means that you can replace:

       df1 =pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
       #print(df1.head())  
       df2 = pd.concat([df,df1]).drop_duplicates().reset_index(drop=True)
       print(df2.head())
       with open("Type1.csv",'a',encoding="utf-8") as f:
            df2.to_csv(f , index = False)
            f.close()

With:

       df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
       with open("Type1.csv",'a',encoding="utf-8") as f:
            df1.to_csv(f, index=False, header=False)

Note that you do not need to close the file f explicitely, the with statement already does that for you. header=False allows to skip the repeating header rows.

Of course you also need to start by writing the first page to the file, before the while True: loop, this time including the header as it is the first page:

with open("Type1.csv",'w',encoding="utf-8") as f:
    df.to_csv(f, index=False)

An alternate and maybe less confusing way is to open the file only once and keep it open:

with open("Type1.csv",'w',encoding="utf-8") as f:
    # code here
    df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
    df.to_csv(f, index=False)

    while True:
        try:
            # code here
            df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
            df1.to_csv(f, index=False, header=False)
        except NoSuchElementException:
            break 

# Here your file is now closed and contains the links from all pages
with open("Type1.csv",'w',encoding="utf-8") as f:
    # code here
    df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
    df.to_csv(f, index=False)

    while True:
        try:
            # code here
            df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
            df1.to_csv(f, index=False, header=False)
        except NoSuchElementException:
            break 

# Here your file is now closed and contains the links from all pages

The way 1 named “1.” would be as follows, concatenating dataframes together and only needing to open the file once:

# code here
df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])

while True:
    try:
        # code here
        df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
        df = df.append(df1, ignore_index=True)
    except NoSuchElementException:
        break 

with open("Type1.csv",'w',encoding="utf-8") as f:
    df.to_csv(f, index=False)
Cimbali
  • 11,012
  • 1
  • 39
  • 68