0

I am trying to use re to pull out a url from something I have scraped. I am using the below code to pull out the data below but it seems to come up empty. I am not very familiar with re. Could you give me how to pull out the url?

match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]

url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', match`

#print url just prints both. I only need the match = "http://www.stats.gov.cn/tjsj/zxfb/ANYTHINGHERE/ANYTHINGHERE.html"

print(url)

Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';"]
Kamikaze_goldfish
  • 856
  • 1
  • 10
  • 24
  • Do you wanna take url from javascript files or from the HTML page source file? – shadowsheep Nov 05 '18 at 15:58
  • If you just want to take a href link on the HTML page source a simple script like that should do the trick: `#!/usr/bin/python from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup html = """put html here""" parsed_html = BeautifulSoup(html, 'html.parser') for link in parsed_html.find_all('a'): print(link.get('href'))` – shadowsheep Nov 05 '18 at 16:06
  • If you want to get the url, this will do the trick. I assume here the URL always start with http. __http:[^']+__ – lucas_7_94 Nov 05 '18 at 16:06
  • That’s easy stuff but the problem is there’s a JavaScript that runs and spits out the url. – Kamikaze_goldfish Nov 05 '18 at 16:08
  • @lucas_7_94 if the OP has to manage shattered or mangled html source code should be better not to use regex – shadowsheep Nov 05 '18 at 16:08
  • @shadowsheep I’m trying to pull the url from the JavaScript file – Kamikaze_goldfish Nov 05 '18 at 16:38
  • I am already using beautifulsoup to scrape the page data by using findAll. I wouldn’t treat this as a beautifulsoup parse problem but just treat it as extracting the url from a string. – Kamikaze_goldfish Nov 05 '18 at 16:41
  • ah, okay. In this case a regex approach should be better. But there could be any possibility to run this javascript inside a fake html page of yours? If you only want to extract url from string in python that's a duplicate question: https://stackoverflow.com/questions/839994/extracting-a-url-in-python – shadowsheep Nov 05 '18 at 16:41
  • @shadowsheep thanks for the feedback. Could you help with one other thing? I don’t see how to get the specific url out that I used in the original question. How can I get only the urls that have the correct link without getting a ton of the useless urls with findall? – Kamikaze_goldfish Nov 05 '18 at 17:04
  • Do you mean something like that? `for link in parsed_html.find_all('a'): if (re.compile('www.stats.gov').search(link.get('href'))): print(link.get('href'))` In comments... source code is really ugly [-: – shadowsheep Nov 05 '18 at 17:26
  • It should be better that you provide an example input and the expected output that you want to obtain and some code that runs so that people can test based on your current running code. – shadowsheep Nov 06 '18 at 08:02
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/183197/discussion-between-kamikaze-goldfish-and-shadowsheep). – Kamikaze_goldfish Nov 06 '18 at 16:22
  • @shadowsheep I fingered it out. See below and give me your feedback. – Kamikaze_goldfish Nov 06 '18 at 17:43
  • Happy you figured it out. There are many regex that could be the job, but if you found one that fits your need we are all happy. – shadowsheep Nov 06 '18 at 18:00

1 Answers1

0

Okay I found the solution. The .+ looks for any number of characters between http://www.stats.gov.cn/ & .html. Thanks for your help with this.

match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]

url = re.findall('http://www.stats.gov.cn/.+.html', str(match))

print(url)

Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html"]
Kamikaze_goldfish
  • 856
  • 1
  • 10
  • 24