I am currently following a course in Big Data but do not understand much of it. For an assignment, I would like to find out which topics are discussed on the TripAdvisor-forum about Amsterdam. I want to create a CSV-file including the topic, the author and the amount of replies per topic. Some questions:
- How can a make a list of all the topics? I checked the website-source for all the pages and the topic is always stated behind
'onclick="setPID(34603)'
and ends with</a>
. I tried'(re.findall(r'onclick="setPID(34603)">(.*?)</a>'
, post)' but it's not working. - The replies are not given in the commentsection, but in a separate row on the page. How can I make a loop and append all the replies to a new variable?
- How do I loop over the first 20 pages? The URL in my code only includes the 1st page, giving 20 topics.
- Do I create the CSV file before or after the looping?
Here is my code:
from urllib import request
import re
import csv
topiclist=[]
metalist=[]
req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60-
Amsterdam_North_Holland_Province.html', headers={'User-Agent' :
"Mozilla/5.0"})
tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")
topicsection=re.findall(r'<b><a(.*?)</div>',tekst)
topic=[]
for post in topicsection:
topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)
author=[]
for post in topicsection:
author.append(re.findall(r'<a href="/members-forums/.*?">(.*?)</a>',
post))
replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)