-2

I'm extracting reviews and information from a website and I want to put them in an excel file while keeping the information structured.

import requests
import urllib.request
import time 
from bs4 import BeautifulSoup

url = 'website'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


for statements in soup.findAll("h3",{'class' : "delta weight-bold half-margin-bottom"}):
    print(statements.text)

for names in soup.findAll("div",{'class': "epsilon weight-bold inline-block"}): 
    print(names.text)

for used_software in soup.findAll("span",{'class' : "weight-semibold"}):
    print(used_software.text, used_software.next_sibling)

yce
  • 57
  • 2
  • 12
  • 1
    What do you mean by structured? Can you provide a sample example how you the excel should look ? – Nothing Jul 24 '19 at 16:08
  • I would like to have, for example, Statements as the name of the column and the text that I receive as a result would be underneath this column. Then the same thing for the name and the used_software. Is it clear? – yce Jul 24 '19 at 16:33
  • Possible duplicate of [Writing to an Excel spreadsheet](https://stackoverflow.com/questions/13437727/writing-to-an-excel-spreadsheet) – esqew Jul 24 '19 at 16:38
  • I don't think it,s the same case @esqew – yce Jul 24 '19 at 16:56

2 Answers2

0

You can use pandas (this is using python3, minor changes need to be made for python2):

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.capterra.com/p/104588/RecTrac/#reviews'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


statements = [
    x.text.strip() for x in soup.findAll("h3", {'class': "delta weight-bold half-margin-bottom"})
]
print(statements)

names = [x.text.strip() for x in soup.findAll("div", {'class': "epsilon weight-bold inline-block"})]
print(names)


used_software = [x.text.strip() for x in soup.findAll("span", {'class': "weight-semibold"})]
used_software_sibling = [x.next_sibling for x in soup.findAll("span", {'class': "weight-semibold"})]
print(used_software)
print(used_software_sibling)

d = {
    'statements': statements,
    'names': names,
    'used_software': used_software,
    'sw_sibling': used_software_sibling,
}

df = pd.DataFrame.from_dict(dict([(k, pd.Series(v)) for k, v in d.items()]))
print(df)

df.to_csv('/tmp/out.csv', index=False)

The final print statement (print(df)) will show:

                                           statements              names           used_software    sw_sibling
0               RecTrac is so close to being awesome!  Verified Reviewer  Used the software for:   6-12 months
1   Powerful software, but a steep learning curve ...  Verified Reviewer                 Source:      Capterra
2      Using this program for the last five years....         Michael B.  Used the software for:     1-2 years
3   User-friendly membership management system--ea...  Verified Reviewer                 Source:      Capterra
4                                     Robust Software  Verified Reviewer  Used the software for:      2+ years
5   Very useful product, but could be more user fr...        Kimberli D.                 Source:      Capterra
6             Customer Service is great to work with.            Brad B.  Used the software for:      2+ years
7                                                 NaN                NaN                 Source:      Capterra
8                                                 NaN                NaN  Used the software for:      2+ years
9                                                 NaN                NaN                 Source:      Capterra
10                                                NaN                NaN  Used the software for:      2+ years
11                                                NaN                NaN                 Source:      Capterra
12                                                NaN                NaN  Used the software for:      2+ years
13                                                NaN                NaN                 Source:      Capterra

And the .csv will show:

$ cat /tmp/out.csv 
statements,names,used_software,sw_sibling
RecTrac is so close to being awesome!,Verified Reviewer,Used the software for:, 6-12 months
"Powerful software, but a steep learning curve when coming from other systems",Verified Reviewer,Source:, Capterra
Using this program for the last five years....,Michael B.,Used the software for:, 1-2 years
User-friendly membership management system--easy to learn and use,Verified Reviewer,Source:, Capterra
Robust Software,Verified Reviewer,Used the software for:, 2+ years
"Very useful product, but could be more user friendly.",Kimberli D.,Source:, Capterra
Customer Service is great to work with.,Brad B.,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra

Here is an update in response to OP's example in comment, that's how much I love you @y.emond:

This is a quick and dirty method to get the output you want, maybe there are better methods.

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.capterra.com/p/104588/RecTrac/#reviews'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


def add_skips(lst):
    old_length = len(lst)
    skipped_statements = []
    print('old_length: ', old_length)

    i = 0
    while i < old_length:
        print('i : ', i)
        skipped_statements.append(lst[i])
        skipped_statements.append(float('nan'))
        i += 1
    return skipped_statements


statements = [
    x.text.strip() for x in soup.findAll("h3", {'class': "delta weight-bold half-margin-bottom"})
]
statements = add_skips(statements)

names = [x.text.strip() for x in soup.findAll("div", {'class': "epsilon weight-bold inline-block"})]
names = add_skips(names)

used_software = [x.text.strip() for x in soup.findAll("span", {'class': "weight-semibold"})]
used_software_sibling = [x.next_sibling for x in soup.findAll("span", {'class': "weight-semibold"})]

d = {
    'statements': statements,
    'names': names,
    'used_software': used_software,
    'sw_sibling': used_software_sibling,
}

df = pd.DataFrame.from_dict(dict([(k, pd.Series(v)) for k, v in d.items()]))
print(df)

df.to_csv('/tmp/out.csv', index=False)

The output:

                                           statements              names           used_software    sw_sibling
0               RecTrac is so close to being awesome!  Verified Reviewer  Used the software for:   6-12 months
1                                                 NaN                NaN                 Source:      Capterra
2   Powerful software, but a steep learning curve ...  Verified Reviewer  Used the software for:     1-2 years
3                                                 NaN                NaN                 Source:      Capterra
4      Using this program for the last five years....         Michael B.  Used the software for:      2+ years
5                                                 NaN                NaN                 Source:      Capterra
6   User-friendly membership management system--ea...  Verified Reviewer  Used the software for:      2+ years
7                                                 NaN                NaN                 Source:      Capterra
8                                     Robust Software  Verified Reviewer  Used the software for:      2+ years
9                                                 NaN                NaN                 Source:      Capterra
10  Very useful product, but could be more user fr...        Kimberli D.  Used the software for:      2+ years
11                                                NaN                NaN                 Source:      Capterra
12            Customer Service is great to work with.            Brad B.  Used the software for:      2+ years
13                                                NaN                NaN                 Source:      Capterra

All NaN values are empty cells when opened in excel/libreoffice.

Perplexabot
  • 1,852
  • 3
  • 19
  • 22
  • my only issue regarding this is that when there are more than 1 output (i.e. Used software there is used the software for and source) is there a way to make sure that it skips a line in excel to show the second names? I'm not sure if this is clear or not – yce Jul 24 '19 at 18:34
  • Your question is in regards to "how" to move data to excel. We have provided you with a method to do so. How to format the excel sheet is a different question. Anyway, I'm not sure what you mean. Do you mean to, for example, skip `for` and only show `source` for the last two columns? – Perplexabot Jul 24 '19 at 18:38
  • Maybe, as @Doodle suggested, place an example of how you want the output. – Perplexabot Jul 24 '19 at 18:49
0

Try this: ( not sure how optimized this is)

# package to save and do other stuff 
import pandas as pd 

statement_text_list = []
names_list = []

# append data to list 
for statements in soup.findAll("h3",{'class' : "delta weight-bold half-margin-bottom"}):
    statement_text_list.append(statements.text)

# append data to list
for names in soup.findAll("div",{'class': "epsilon weight-bold inline-block"}): 
    names_list.append(names.text)

# similar code for other fields 

# create a dataframe 
dt = pd.DataFrame({'Statement':statement_text_list, 'Names': names_list })

# code to save to a excel file 
dt.to_excel('filename')

Nothing
  • 502
  • 3
  • 11
  • 1
    Please don't post screenshots of code - it reduces the utility for the original asker and for future visitors to this question. – esqew Jul 24 '19 at 17:11
  • @esqew I was on iPad and somehow code block was not working hence uploaded the screenshot, will keep that in mind, answer updated – Nothing Jul 24 '19 at 17:14