I have used bs4 to crawl some text, I want to find all urls that match the following starting string: https://www.104.com.tw/company/
For example, “https://www.104.com.tw/company/aw5oe14?jobsource=checkc” and “https://www.104.com.tw/company/18sepdbk?jobsource=check”
I am not familiar with RegEx and have try:
raw = get_page("https://www.104.com.tw/cust/list/index/?page=2&keyword=%E8%87%AA%E5%8B%95%E5%8C%96&order=1&mode=s&jobsource=checkc")
address = re.findall(r'https://www.104.com.tw/company/[\w]+',raw)
print(address)
# where raw is the text crawled, and get_page is function, both of them work correctly.
It showed error as:
TypeError Traceback (most recent call last)
<ipython-input-18-387fb92bcd6d> in <module>
1 raw = get_page("https://www.104.com.tw/cust/list/index/?page=2&keyword=%E8%87%AA%E5%8B%95%E5%8C%96&order=1&mode=s&jobsource=checkc")
----> 2 address = re.findall(r'https://www.104.com.tw/company/.*$',raw)
3 print(address)
/opt/conda/envs/Python36/lib/python3.6/re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object
What regular expression should I use, or if that is the problem with the re.findall syntax?
Thanks,