What you’re doing for each page is basically these steps:
- get links in df1
- compute df2, which is df (links from the first page) + df1 (links from current page)
- write df2 to a csv file
So the program is doing what you ask it to do by repeating the first page every time. If you want to avoid that, you should either:
- concatenate all pages together, then write the dataframe containing all pages’ links to csv
- write each pages’ links appending to the csv, skipping the header row.
Going with option 2 that means that you can replace:
df1 =pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
#print(df1.head())
df2 = pd.concat([df,df1]).drop_duplicates().reset_index(drop=True)
print(df2.head())
with open("Type1.csv",'a',encoding="utf-8") as f:
df2.to_csv(f , index = False)
f.close()
With:
df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
with open("Type1.csv",'a',encoding="utf-8") as f:
df1.to_csv(f, index=False, header=False)
Note that you do not need to close the file f
explicitely, the with
statement already does that for you. header=False
allows to skip the repeating header rows.
Of course you also need to start by writing the first page to the file, before the while True:
loop, this time including the header as it is the first page:
with open("Type1.csv",'w',encoding="utf-8") as f:
df.to_csv(f, index=False)
An alternate and maybe less confusing way is to open the file only once and keep it open:
with open("Type1.csv",'w',encoding="utf-8") as f:
# code here
df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
df.to_csv(f, index=False)
while True:
try:
# code here
df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
df1.to_csv(f, index=False, header=False)
except NoSuchElementException:
break
# Here your file is now closed and contains the links from all pages
with open("Type1.csv",'w',encoding="utf-8") as f:
# code here
df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
df.to_csv(f, index=False)
while True:
try:
# code here
df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
df1.to_csv(f, index=False, header=False)
except NoSuchElementException:
break
# Here your file is now closed and contains the links from all pages
The way 1 named “1.” would be as follows, concatenating dataframes together and only needing to open the file once:
# code here
df = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
while True:
try:
# code here
df1 = pd.DataFrame(links.items(), columns=['Product name', 'Detail'])
df = df.append(df1, ignore_index=True)
except NoSuchElementException:
break
with open("Type1.csv",'w',encoding="utf-8") as f:
df.to_csv(f, index=False)