0

I am currently following a course in Big Data but do not understand much of it. For an assignment, I would like to find out which topics are discussed on the TripAdvisor-forum about Amsterdam. I want to create a CSV-file including the topic, the author and the amount of replies per topic. Some questions:

  1. How can a make a list of all the topics? I checked the website-source for all the pages and the topic is always stated behind 'onclick="setPID(34603)' and ends with </a>. I tried '(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)' but it's not working.
  2. The replies are not given in the commentsection, but in a separate row on the page. How can I make a loop and append all the replies to a new variable?
  3. How do I loop over the first 20 pages? The URL in my code only includes the 1st page, giving 20 topics.
  4. Do I create the CSV file before or after the looping?

Here is my code:

from urllib import request
import re
import csv

topiclist=[]
metalist=[]

req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60- 
Amsterdam_North_Holland_Province.html', headers={'User-Agent' : 
"Mozilla/5.0"})

tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")


topicsection=re.findall(r'<b><a(.*?)</div>',tekst)

topic=[]
for post in topicsection:
   topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)


author=[]
for post in topicsection: 
   author.append(re.findall(r'<a href="/members-forums/.*?">(.*?)</a>', 
   post))

replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)
cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
Tessa
  • 19
  • 3
  • As much as I hate to say it, if you're scraping web pages you're probably going to have best luck using `xml.dom` – Mr. Dave May 15 '16 at 17:48

1 Answers1

3

Don't use regular expressions to parse HTML. Use an html parser such as beautifulsoup.

e.g -

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.tripadvisor.com/ShowForum-g188590-i60-Amsterdam_North_Holland_Province.html")
soup = BeautifulSoup(r.content, "html.parser") #or another parser such as lxml
topics = soup.find_all("a", {'onclick': 'setPID(34603)'})
#do stuff
Community
  • 1
  • 1
Pythonista
  • 11,377
  • 2
  • 31
  • 50
  • 1
    See [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) and [this one](http://stackoverflow.com/a/1758162/2791611) for reasons why – amiller27 May 15 '16 at 17:39
  • Thanks for adding the links! I was actually looking for the first one earlier. That's a gem of an answer. – Pythonista May 15 '16 at 17:40
  • Thank you very much! I indeed tried BeautifulSoup as well but with another code. When I print the topics, I do get to see them but still get the URL's to the topic in the output. For instance: Taste of Amsterdam festival Would it be possible to ONLY print 'Taste of Amsterdam festival' ? – Tessa May 15 '16 at 19:28
  • Just call `.text` on the BS object. So, if you iterate over `topics`: `for topic in topics: print(topic.text)` – Pythonista May 15 '16 at 19:33
  • That's indeed working, great! Is it also best to use Beautifulsoup for looping through different pages? I tried what is explained here: http://stackoverflow.com/questions/27752860/beautifulsoup-looping-through-urls but it doesnt seem to work – Tessa May 15 '16 at 19:49