I am attempting to scrape unstructured data from multiple URLs on a website. I used BeautifulSoup to successfully pull out the chunks that I needed. Then, to help structure the dataset I added the values to a list before writing them to a csv file.
When attempting to transfer the data however, only the last value in the list is transferred. I figured this is because the list gets new values every time the the loop is called. How can I continually add new values to the file so that my csv file has values from each loop? Thank you.
for i in range(1, 3):
url = "https://website.com/webid={}".format(i)
s = session.get(url, headers=headers, cookies=cookies)
soup = bs(s.text, 'html.parser')
data = soup.find_all('td')
t = soup.find_all('td')
a = t[0]
b = t[1]
c = t[2]
info = [a, b, c]
print(info)
df = pd.DataFrame(info)
df.to_csv('a.csv', index=False, header=False)
In response to comments and additional answers:
If my original code block was unclear I apologize, I was attempting to produce the minimum necessary code to explain my circumstances. Luckily @Matt_F was able to understand and guide me in the right direction. For those that would like a more explicit explanation of the code I was running, please see below for my full code block (without imports, cookies, headers, and payload).
session = requests.Session()
s = session.post("https://WEBSITE.com/register?view=login&return=aW5kZXgucGhwP0l0ZW1pZD02NjM", data=payload, headers=headers, cookies=cookies)
for i in range(0,9999):
print(i)
# establish connection
url = "https://WEBSITE.com/WEB-SITE/data-list?vw=detail&id={}&return=1".format(i)
s = session.get(url, headers=headers, cookies=cookies)
# set timer for delay
t = time.time()
delay = time.time() - t
time.sleep(10*delay)
# begin to pull data
soup = bs(s.text, 'html.parser')
if "Error: no data found" in s.text:
print('skipped')
else:
soup.prettify()
# print(soup)
d = soup.find_all('td',{"valign": "top"})
d_info = d[0:-1]
print(d_info)
df1 = pd.DataFrame(d_info)
df1t = df1.T
# p = soup.find_all('p')
# p_info = p[0:-1]
# df2 = pd.DataFrame(p_info)
# df2t = df2.T
# result = pd.concat([df1t, df2t], axis=1, sort=False)
df1t.to_csv('file.csv', mode='a', index=False, header=False)