0

Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.

I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?

Vladislav
  • 2,772
  • 1
  • 23
  • 42
Thomas Segato
  • 4,567
  • 11
  • 55
  • 104
  • 1
    It really depends on what you understand by web scraping, i.e. how the content looks that you'd like to scrape. I've successfully used HtmlAgilityPack library to scrape static html content – silent Jul 30 '19 at 18:35
  • More details needed for a helpful answer – Alex Gordon Jul 30 '19 at 18:37
  • Thanks both for response. I have edited post with more info. Let me know if you need more. – Thomas Segato Jul 31 '19 at 15:30
  • pick one small issue to work on, get it done without using azure functions, and then we can talk about using functions -- your question is still too broad to be answered and will be closed – Alex Gordon Jul 31 '19 at 16:58
  • I hope not. I can solve the problem using Selenium and WebBrowser control. It needs to be compatible with none gui services. – Thomas Segato Jul 31 '19 at 17:39

1 Answers1

1

Your question is purely .NET-C#-ish (at least I assume you use .net c#). Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.

For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.

I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).

Vladislav
  • 2,772
  • 1
  • 23
  • 42
  • Yeah I think your right. I properly need to do some selenium server and then fetch data from there. Think it will be hard to avoid a real browser driver. – Thomas Segato Aug 01 '19 at 07:23