2

I am attempting to scrape unstructured data from multiple URLs on a website. I used BeautifulSoup to successfully pull out the chunks that I needed. Then, to help structure the dataset I added the values to a list before writing them to a csv file.

When attempting to transfer the data however, only the last value in the list is transferred. I figured this is because the list gets new values every time the the loop is called. How can I continually add new values to the file so that my csv file has values from each loop? Thank you.

for i in range(1, 3):
    url = "https://website.com/webid={}".format(i)
    s = session.get(url, headers=headers, cookies=cookies)
    soup = bs(s.text, 'html.parser')
    data = soup.find_all('td') 
    t = soup.find_all('td')
    a = t[0]
    b = t[1]
    c = t[2]
    info = [a, b, c]
    print(info)

df = pd.DataFrame(info)
df.to_csv('a.csv', index=False, header=False)

In response to comments and additional answers:

If my original code block was unclear I apologize, I was attempting to produce the minimum necessary code to explain my circumstances. Luckily @Matt_F was able to understand and guide me in the right direction. For those that would like a more explicit explanation of the code I was running, please see below for my full code block (without imports, cookies, headers, and payload).

session = requests.Session()
s = session.post("https://WEBSITE.com/register?view=login&return=aW5kZXgucGhwP0l0ZW1pZD02NjM", data=payload, headers=headers, cookies=cookies)

for i in range(0,9999):
    print(i)
    # establish connection
    url = "https://WEBSITE.com/WEB-SITE/data-list?vw=detail&id={}&return=1".format(i)
    s = session.get(url, headers=headers, cookies=cookies)
    # set timer for delay
    t = time.time()
    delay = time.time() - t
    time.sleep(10*delay)
    # begin to pull data
    soup = bs(s.text, 'html.parser')
    if "Error: no data found" in s.text:
        print('skipped')
    else:
        soup.prettify()
        # print(soup)
        d = soup.find_all('td',{"valign": "top"})
        d_info = d[0:-1] 
        print(d_info)
        df1 = pd.DataFrame(d_info)
        df1t = df1.T
    
        # p = soup.find_all('p')
        # p_info = p[0:-1]
        # df2 = pd.DataFrame(p_info)
        # df2t = df2.T
    
        # result = pd.concat([df1t, df2t], axis=1, sort=False)
        df1t.to_csv('file.csv', mode='a', index=False, header=False)  
Display name
  • 753
  • 10
  • 28
  • where are you declaring `info`? – Aziz Sonawalla Jun 04 '20 at 00:13
  • hi there dear Bjørn_Jung - many thanks for this great example; i am currently divin into all things python and csv and pandas. I like your example. Could you provide a URL that we can run this great and cristal-clear demo-code. That would be fantastic. Love to hear from you. Greetings ;) – zero Jul 03 '20 at 12:28

2 Answers2

3

I believe what your issue is that you are opening your csv file in write mode which is the default mode. You should be opening it in "append" mode with the 'a' attribute.

df.to_csv('a.csv', mode='a', index=False, header=False)

see this thread

Display name
  • 753
  • 10
  • 28
Matt_F
  • 46
  • 2
  • many thanks dear Matt for providing your great answer - it is very helpful. I am glad to see this thread! – zero Jul 03 '20 at 12:29
0

on a sidenote - Code like this:

a = t[0]
b = t[1]
c = t[2]
d = t[3]
e = t[4]

Code like this is pretty hard for me as a python beginner.

i have mused about the design: shouldnt we Use data structures to represent your data. You're assigning elements from a list to names and then you create from this a new list.

so i guess that your data is 2-dimensional. The first dimension is the index (rows) and
the second dimension are the columns (td-data).

i have learned that we have to create an empty list, which is later your whole dataset. For each tag you need the text or an attribute. Putting a whole tag object into pandas will not work.

td_results = []
for i in range(1, 100):
    url = "https://my-website.com/webid={}".format(i)
    s = session.get(url, headers=headers, cookies=cookies)
 
    soup = bs(s.text, 'html.parser')
    data = soup.find_all('td') 
    td_results.append(column.text for column in soup.find_all('td')) # <- this here is the critical part
    # he could find something or not
    # and the amount of td elements can be different


print(td_results)
df = pdDataFrame(td_results)

So if you know all pages do have the same structure and you know for example, that you need the first 10 element, then you can use the subscription method.

Example to get the first 10 elements:

td_results.append(column.text for column in soup.find_all('td')[:10])

What do you think about these musings!? look forward to hear from you

zero
  • 1,003
  • 3
  • 20
  • 42