-1

I have used bs4 to crawl some text, I want to find all urls that match the following starting string: https://www.104.com.tw/company/

For example, “https://www.104.com.tw/company/aw5oe14?jobsource=checkc” and “https://www.104.com.tw/company/18sepdbk?jobsource=check

I am not familiar with RegEx and have try:

raw = get_page("https://www.104.com.tw/cust/list/index/?page=2&keyword=%E8%87%AA%E5%8B%95%E5%8C%96&order=1&mode=s&jobsource=checkc")
address = re.findall(r'https://www.104.com.tw/company/[\w]+',raw)
print(address)

# where raw is the text crawled, and get_page is function, both of them work correctly. 

It showed error as:

TypeError                                 Traceback (most recent call last)
<ipython-input-18-387fb92bcd6d> in <module>
      1 raw = get_page("https://www.104.com.tw/cust/list/index/?page=2&keyword=%E8%87%AA%E5%8B%95%E5%8C%96&order=1&mode=s&jobsource=checkc")
----> 2 address = re.findall(r'https://www.104.com.tw/company/.*$',raw)
      3 print(address)

/opt/conda/envs/Python36/lib/python3.6/re.py in findall(pattern, string, flags)
    220 
    221     Empty matches are included in the result."""
--> 222     return _compile(pattern, flags).findall(string)
    223 
    224 def finditer(pattern, string, flags=0):

TypeError: expected string or bytes-like object

What regular expression should I use, or if that is the problem with the re.findall syntax?

Thanks,

Peter Lin
  • 1
  • 1

1 Answers1

0

The simplest would be:

https://www\.104\.com\.tw/company/.+

Regex Demo

With your original regex where you are using [\w]+, this will not match the entire string as ? is not part of the \w (i.e. [a-zA-Z0-9_]) set.

vs97
  • 5,765
  • 3
  • 28
  • 41