-2

Hi guys i am trying to solve this and i don't really know what to do. I scraped this website https://www.financialjuice.com/home and saved it to my database and it did worked successfully.

But the issue i have is if a scraped item is clicked on my app, it firsts gets to financial juice first before going to the main source of the news

That is on financial juice they might have a new they got from BBC and my scrapy takes in that item, once you click on the url, it firsts gets to financial juice first before going to BBC

What do you think i can do please your suggestion is welcomed.

molecules
  • 21
  • 2
  • 12
  • 1
    Your question is still a little unclear, what exactly is the issue? – information_interchange Nov 19 '17 at 04:01
  • I want to be able to get the link it's redirected to straight away instead of first visiting financial juice before getting to the actual news source – molecules Nov 19 '17 at 04:04
  • If you check the financial juice you will notice before the news source came up, there was a loading on financial juice before it finally brought the source up. – molecules Nov 19 '17 at 04:11

1 Answers1

0

Share one of the scraped URL's but what I assume is the problem is that financial juice is not giving you the direct url but one with redirection. So basically this is a link on front page

https://www.financialjuice.com/News/3772381/A-week-end-of-decision-for-Germany.aspx

which loads rthen redirects to

http://www.forexlive.com/news/!/a-week-end-of-decision-for-germany-20171118

Helps them keep track of which links were visited from outside the website (social media sharing etc) and prevent exactly what you have done.

You will need to run a script to visit the link and then get the url after the last redirection.

for example using urllib2. The geturl gives you the final url of the opened object.

finalurl = urllib2.urlopen(intialurl, None, 1).geturl()

If the redirecction is with a script then you need to use Selenium. See here for a good example. I modified the below code for you and it worked quite well

from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
chromepath='/usr/bin/chromedriver' #//change this to your chromedriver path
driver = webdriver.Chrome(chromepath)
driver.get('https://www.financialjuice.com/News/3772381/A-week-end-of-decision-for-Germany.aspx')


time.sleep(10)
print(driver.current_url)

driver.quit()
kmcodes
  • 807
  • 1
  • 8
  • 20