0

I'm trying to get properly the information in a file .csv, but code is scraping information more than five times. Normally I should have 31 reviews and, in the file, it shows me 301. I have tried to follow the answer to this question Data to .csv is repeating three times. I need three different scrapes exported to a csv file but I understood anything. And the answer for this question Python repeating CSV file, I tried to change my code taking into account that solution but it doesn't work. Also I tried to change the variable's name but it doesn't either. Could you tell me what is wrong and what I have to do to get information properly? i'm really novice in coding so please, if you can explain me line by line yours modifications, I will appreciate them!

with requests.Session() as s:
        for offset in range(10,40):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'referer': r.url}
                    )

            soup = bs(r.content, 'lxml')
            if not offset:
                    inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                    rest_eclf = soup.select_one('.header_links a').text.strip()

            for review in reviews:
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)
Nprof
  • 9
  • 5

2 Answers2

0

Depending on how many ids can be posted I would issue all the requests that get ids first. Then a single post with all those ids.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

ids = []
results = []

with requests.Session() as s:
    for offset in range(10,40,10):
        url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
        r = s.get(url)
        soup = bs(r.content, 'lxml')

        if offset == 10:
            inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
            rest_eclf = soup.select_one('.header_links a').text.strip()

        reviews = soup.select('.reviewSelector')
        ids += [review.get('data-reviewid') for review in reviews]


    r = s.post('https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=', data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                headers = {'referer': r.url}
                )

soup = bs(r.content, 'lxml')
reviews = soup.select('.reviewSelector')

for review in reviews:
    name_client = review.select_one('.info_text > div:first-child').text.strip()
    date_rev_cl = review.select_one('.ratingDate')['title'].strip()
    titre_rev_cl = review.select_one('.noQuotes').text.strip()
    opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
    row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
    results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:\Users\User\data.csv', sep=',', encoding='utf-8-sig',index = False)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hi! thank you for your response. I found the solution before your answer but I'm gonna test it. I would have to install pandas. – Nprof Aug 27 '19 at 11:54
  • Remember you can always post your solution as an answer. – QHarr Aug 27 '19 at 12:10
0

My loop was runing 30 times, once for each number between 10 and 40. As every number 10-19 was redirected to 10, 20-29to 20, etc., this means that I was scraping each of those pages 10 times, geting 10 duplicates for each review. So the third argument(10) in range, allowed a change every tenth number.

import requests,csv
from bs4 import BeautifulSoup as bs

with open("bouclard.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ";", quoting=csv.QUOTE_MINIMAL)
w.writerow(["inf_rest_name", "rest_eclf", "name_client", "date_rev_cl", "titre_rev_cl", "opinion_cl"])

with requests.Session() as s:
    for offset in range(10,40,10):
        url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
        r = s.get(url)
        soup = bs(r.content, 'lxml')
        reviews = soup.select('.reviewSelector')
        ids = [review.get('data-reviewid') for review in reviews]
        r = s.post(
                'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                headers = {'referer': r.url}
                )
           
        soup = bs(r.content, 'lxml')
        if not offset:
            inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
            rest_eclf = soup.select_one('.header_links a').text.strip()
   
        for review in soup.select('.reviewSelector'):
            name_client = review.select_one('.info_text > div:first-child').text.strip()
            date_rev_cl = review.select_one('.ratingDate')['title'].strip()
            titre_rev_cl = review.select_one('.noQuotes').text.strip()
            opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
            row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
            print(row)
Nprof
  • 9
  • 5
  • This code will still error I'm afraid for the reason I gave as an answer to one of your prior questions. The if not offset – QHarr Aug 27 '19 at 18:56
  • Look, it is. I forgot paste a part of the code. It was working for me. But as you helped me a lot I accept your answer because the two last scraping I could do them thank to your suggestions and your code is working. – Nprof Aug 27 '19 at 20:35
  • Are you perhaps running this in Jupyter notebook? The if not offset block should use first soup and be of offset == – QHarr Aug 27 '19 at 20:48
  • No. Im running the code in spyder. It is working well. I could have the 30 reviews – Nprof Aug 27 '19 at 20:51
  • Perhaps spyder, like jupyter, stores variables from earlier runs (within same document context, because the above, if you run it in a new jupyter notebook for example, will give you _NameError: name 'inf_rest_name' is not defined_ for the reasons we discussed [here](https://stackoverflow.com/questions/57662787/showing-two-diff%C3%A9rents-errors-with-the-same-code-that-i-used-to-scrape-other-pag) – QHarr Aug 27 '19 at 20:54
  • aaahhh ok, maybe, I don't have jupyter to test it. the important is that your solution as mine are working. – Nprof Aug 27 '19 at 21:00
  • Yeah.... I just tested in spyder and it remembers variables from session. You can find them in the Variable explorer. Means you ran different set up at some point which generated the inf_rest_name and rest_eclf variables and they are being remembered even though current code doesn't set them. If you click remove (for both those variables) in the Variable explorer and re-run code you should see the previously masked errors. – QHarr Aug 27 '19 at 21:06
  • aaahh ok! I'm gonna check. – Nprof Aug 28 '19 at 14:00
  • I have a question. As all information is in several files, I have to take it in just one file and for that it was copied and pasted in one file but when I open the file, information is not in columns. is it normal ? how can I resolve that? this file can be opened and modified in python?... and also If I copy and paste all information in a excel book, in what format I have to keep it to open it and modified from python?... Thank you in advance for your response. – Nprof Aug 28 '19 at 14:19
  • You can append to existing files rather than write to new ones. You can open and modify existing files. To open an modify you alter the arguments in the function call. https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file – QHarr Aug 28 '19 at 14:58