0

I want to ask some questions.

I am using Python 3.7.6, web driver and selenium to do web crawler

And then, I used Visual Studio Code to finish my web crawler, and I output a csv file.

I used "find_elements_by_xpath" to catch some information. The following image is my part code:

from datetime import date,datetime
from selenium import webdriver #載入webdriver
from selenium.webdriver.common.keys import Keys #載入按鍵
from bs4 import BeautifulSoup #載入BeautifulSoup工具
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import xlrd
import csv
import codecs
import time

data = xlrd.open_workbook('B.xlsx')
table = data.sheets()[0]
print(table)
nrows = table.nrows 
ncols = table.ncols 
print(ncols)
print(nrows)
for i in range(1,nrows):
    csv_post="Post_No_" + str(i) + ".csv"
    with open(csv_post, 'a', newline='', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['互動作者','發表時間','互動內容'])
    print_link = table.cell_value(i,3)
    print(i)
    print(print_link)
    driver_blank=webdriver.Chrome('./chromedriver') #使用chrome作為爬蟲輔助工具,把chromedriver載入進來
    driver_blank.get(print_link)
    time.sleep(1)
    post_page_count  = len(driver_blank.find_elements_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[4]/div[2]/div[2]/select/option"))

    if(post_page_count != 0):
        try_value=1
        while(try_value):
            try:
                driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table[2]")
                print("測試顯示正常")
                try_value=0
            except NoSuchElementException as e:
                print("測試顯示異常,現正刷新網頁")
                driver_blank.refresh()
                time.sleep(10)
        print("總頁數:"+str(post_page_count))
        table_rows=len(driver_blank.find_elements_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table"))
        print("共有"+str(table_rows)+"個Table")

        real_table_rows=table_rows+1

        #only 1
        post_author = driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table[1]/tbody/tr[2]/td[1]/a")
        post_content = driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/div")
        post_time = driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[4]/td/div[2]/span")
        print("互動作者:"+post_author.text)
        print("互動內容:")
        print(post_content.text)
        print("發表時間:"+post_time.text)
        print("<<< --- >>>")
        with open(csv_post, 'a', newline='', encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow([post_author.text,post_time.text,post_content.text])

enter image description here

The following is the forum post: (https://forumd.hkgolden.com/view.aspx?type=MB&message=7197409)

enter image description here

I want to catch text, emoji, and image. I can catch only the text, but I cannot catch emoji and image. I don't know what to do. Can anyone help me? Thank you.

Jiun Jung Lin
  • 147
  • 1
  • 9
  • Yes we can help you but please don't make us rewrite all [your code from image](https://i.stack.imgur.com/zNQEa.jpg), - https://meta.stackoverflow.com/a/285557/4539709 – 0m3r Apr 05 '20 at 08:09
  • Sorry, I edited the question already~~~ – Jiun Jung Lin Apr 05 '20 at 09:31
  • this may help you,and by the way ,I suggest try requests and BeautifulSoup, it's mor stable and quickly than selenium.https://stackoverflow.com/questions/17361742/download-image-with-selenium-python – xiaoming Apr 12 '20 at 08:12

0 Answers0