2

It looks like the <class id> for <img class> on Instagram's web page is changing every day. Right now it is FFVAD and tomorrow it will be something else. For example (I made it shorter, links are long):

<img class="FFVAD" alt="Tag your best friend" decoding="auto" style="" sizes="293px" src="https://scontent-lax3-2.cdninstagram.com/vp/0436c00a3ac9428b2b8c977b45abd022/5BAB3EBC/t51.2885-15/s640x640/sh0.08/e35/33110483_592294374461447_8669459880035221504_n.jpg">

By saying that, I need to fix the script and hardcode the Class ID in order to be able scrape the web-page.

var = driver.find_elements_by_class_name('FFVAD')

Somebody told me that I could use img.get_attribute('class') to find the class ID and store it for later. But I still don't understand how this can be achieved, so selenium or soup could grab the Class ID from the html tag and store or parse it later.

All I got now is this. It's little dirty, and not right, but the idea is there.

import requests
import selenium.webdriver as webdriver

url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    imgs_dedupe = driver.find_elements_by_class_name('FFVAD')

    for img in imgs_dedupe:
        posts = img.get_attribute('class')
        print posts

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

When I run it, I get this output, and because there are 3 images on the page, I get 3x Class ID

python tag_print.py 
FFVAD
FFVAD
FFVAD
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
P_n
  • 940
  • 2
  • 11
  • 25
  • Instagram requires registration to access anything whatsoever, so I cannot give a concrete example. – ivan_pozdeev May 31 '18 at 20:35
  • It can be viewed only if you inspect the element. No need to be registered or logged in – P_n May 31 '18 at 20:38
  • Oh, so the front page can be used, too. That changes matters. – ivan_pozdeev May 31 '18 at 20:39
  • 1
    Yes, you can just go to https://www.instagram.com/kitties and see all the content as longest the profile is public – P_n May 31 '18 at 20:40
  • Find the image with `alt="Tag your best friend"`, get its class, then use that to search for other elements with the same class. – Barmar May 31 '18 at 21:31
  • I thought about it too, but the script I made is interactive, and as you know all users or profiles are unique. In this case when I decide to scrape a different page, the class can be different too – P_n May 31 '18 at 21:33
  • How about `driver.find_elements_by_tag_name('img')`? – Barmar May 31 '18 at 21:34
  • You know they're doing this deliberately to make things hard for web scrapers like you, right? – Barmar May 31 '18 at 21:37
  • this is just for my own personal use, and good practice – P_n May 31 '18 at 21:38
  • @Barmar `driver.find_elements_by_tag_name('img')` kinda works. It gets all the tags containing `img`. Is it even possible to target the first image only? So it grabs that one class id – P_n May 31 '18 at 21:44
  • `imgs_dedupe[0]` should be the first one. – Barmar May 31 '18 at 21:50
  • Try `driver.find_element_by_css_selector('img[alt="Tag your best friend"]')` – yong May 31 '18 at 23:50

2 Answers2

3

You're currently searching for the element by a hardcoded class name.

If the class name is randomized, you cannot hardcode it any longer. You have to either:

  • Search the element by some other characteristics (e.g. element hierarchy, some other attributes, etc; XPath can do that)

    In [10]: driver.find_elements_by_xpath('//article//img')
    Out[10]:
    [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]
    
    • You can also search by the element's visual characteristics: size, visibility, position. This cannot be done solely by XPath though, you'll have to get all <img> tags and inspect each one with JS by hand.
      (See an example below because it's long.)
  • Learn this class name somehow from other page logic (it must be present somewhere else if the page's logic itself can find and use it, and that logic must be found by something else, etc etc)

    In this case, the class name is a part of a local variable in the renderImage function, so it's only salvageable via DOM by exploring its AST. The function itself is buried somewhere inside webpack machinery (it seems to pack all resources into a few global objects with one-letter names). Alternatively, you can read all included JS files as raw data and look for the definition of renderImage in them. So, in this case, it's disproportionally hard, though theoretically possible still.


Example of getting elements by visual characteristics

On any page whatsoever, this would find 3 images of the same size, located side by side (this is the way they are at https://www.instagram.com/kitties).

Since HTMLElements can't be passed to Python directly (at least, I couldn't find any way to), we need to pass some unique IDs instead to locate them by, like unique XPath's.

(The JS code could probably be more elegant, I don't have much experience with the language)

In [22]: script = """
  //https://stackoverflow.com/questions/2661818/javascript-get-xpath-of-a-node/43688599#43688599
  function getXPathForElement(element) {
      const idx = (sib, name) => sib 
          ? idx(sib.previousElementSibling, name||sib.localName) + (sib.localName == name)
          : 1;
      const segs = elm => !elm || elm.nodeType !== 1 
          ? ['']
          : elm.id && document.querySelector(`#${elm.id}`) === elm
              ? [`id("${elm.id}")`]
              : [...segs(elm.parentNode), `${elm.localName.toLowerCase()}[${idx(elm)}]`];
      return segs(element).join('/');
  }

  //https://plainjs.com/javascript/styles/get-the-position-of-an-element-relative-to-the-document-24/
  function offsetTop(el){
    return window.pageYOffset + el.getBoundingClientRect().top;
  }

  var expected_images=3;
  var found_groups=new Map();
  for (e of document.getElementsByTagName('img')) {
    let group_id = e.offsetWidth + "x" + e.offsetHeight;
    if (!(found_groups.has(group_id))) found_groups.set(group_id,[]);
    found_groups.get(group_id).push(e);
  }
  for ([k,v] of found_groups) {
    if (v.length != expected_images) {found_groups.delete(k);continue;}
    var offset_top = offsetTop(v[0]);
    for (e of v){
      let _c_oft = offsetTop(e);
      if (_c_oft !== offset_top){
        found_groups.delete(k);
        break;
      }
    }
  }
  if (found_groups.size != 1) {
    console.log(found_groups);
    throw 'Unexpected pattern of images after filtering';
  }

  var found_group = found_groups.values().next().value;


  result=[]
  for (e of found_group) {
    result.push(getXPathForElement(e));
  }
  return result;
"""

In [23]: d.execute_script(script)
Out[23]:
[u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[1]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[2]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[3]/a[1]/div[1]/div[1]/img[1]']

In [27]: [d.find_element_by_xpath(xp) for xp in _]
Out[27]:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
0

So I managed to get it using (outside of loop, of course)

get_img_class = driver.find_elements_by_class_name('img')[1].get_attribute('class')

Just like that I am able to parse the Class ID and store it for a later use. Thanks so much for everyones help. All ideas are great and noted for later use.

P_n
  • 940
  • 2
  • 11
  • 25