1

Description of the situation: It is a script that scrolls in a frame in order to extract the information.

<ul>

<li> </li>
<li> </li>
<li> </li>
<li> </li>
<li> </li>
...
</ul> 

The list length of about 30 items, when scrolling, no new items are added <li> </li>, only updated. The structure of the DOM does not increase.

Explaining the problem: When the script scrolls, it must extract all the elements of the <li> </li> for each iteration because they are renewed.

Here is the logic of scrolling and extracting elements. The code I use:

SCROLL_PAUSE_TIME = 5

# Get scroll height
last_height = driver.execute_script("return document.querySelector('div[data-tid=\"pane-list-viewport\"]').scrollHeight;")

all_msgs_loaded = False

while not all_msgs_loaded:

    li_elements: List[WebElement] = self._driver.find_elements(By.XPATH, "//li[@data-tid='pane-item']")

    driver.execute_script("document.querySelector('li[data-tid=\"pane-item\"]').scrollIntoView();")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.querySelector('div[data-tid=\"pane-list-viewport\"]').scrollHeight;")
    if new_height == last_height:
        all_msgs_loaded = True
    last_height = new_height

For each iteration li_elements receives about 30 WebElements. If i comment on the line with find_elements, the script works for hours without increasing the RAM consumption. I mention that I do not save anything in runtime, that I don't have an increase in consumption elsewhere.

Another way I used to get li_elements is through self._driwer.execute_script ()

Example:

li_elements = (self._driver.execute_script(
                 "return document.querySelectorAll('li[data-tid=\"pane-item\"]');",
                 WebDriverWait(self._edge_driver, 20).until(
                     EC.visibility_of_element_located((By.XPATH, "//li[@data-tid='pane-item']")))

By both methods I get the same result that I have, but the RAM increase is the same. RAM grows indefinitely until TaskManager destroys the process on its own for security.

I analyzed the internal structure of these functions, but I did not find anything that could load the RAM. Another modality would be find_elements_by_css_selector (), but inside it is called find_elements ().

I also tried different combinations with sleep (), but nothing helps, RAM does not decrease.

Can you please explain to me what is happening in reality, I do not understand why RAM consumption is increasing.

Can you tell me if there is another method of extracting the elements without consuming RAM?

Andrew
  • 23
  • 3

2 Answers2

0

By any means find_elements() method of Selenium shouldn't be consuming so much of RAM. Most possibly it's the Browsing Context e.g. which consumes more RAM while you scrollIntoView() incase the <li> items gets updated through JavaScript or AJAX.

Without any visibility in the DOM Tree it would be difficult to predict the actual reason or any remediation. However, a similar discussion suggests to use some waits interms of time.sleep(n)

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Hello @undetected Selenium, thanks for the reply. In my case I use Microsoft Edge, if I comment on the line with `find_elements ()`, `scrollIntoView ()` works, but the RAM consumption does not increase at all, because the DOM structure does not increase, it is static. The `
      ` list of items `
    • ` is only updated from the server.
    – Andrew Mar 03 '22 at 06:27
  • https://prnt.sc/l7fj8PFQsLOM , here is a screen from the
      List, at each scroll it is only updated from the server
    – Andrew Mar 03 '22 at 06:36
  • _
      list of items
    • is only updated from the server_: Definately JavaScript/AJAX is in play.
    – undetected Selenium Mar 03 '22 at 08:26
  • Yes, you are right. But without `find_elements ()`, scrolling doesn't affect RAM at all. – Andrew Mar 03 '22 at 08:53
0

Try getting just what you need instead of the full element:

lis = driver.execute_script("""
  return [...document.querySelectorAll('li[data-tid="pane-item"]')].map(li => li.innerText)
""")

I can't tell what you're doing with them, but if you're adding elements to a big array, and there's enough of them, you will hit a RAM limit

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Hello @pguardiario, thanks for the reply. I tried a similar combination. `li_elements = self._driver.execute_script("return [...document.querySelectorAll('li[" "data-tid=\"chat-pane-item\"] div[" "class*=\"ui-chat__item__message\"]')].map(div => " "div.innerHTML);")` – Andrew Mar 03 '22 at 08:48
  • Now I get the same result, but the RAM grows much slower, and much less than it was at the same time. At the same datetime the RAM difference is about 600-700 MB. Very strange how that changed things, but it's much better. – Andrew Mar 03 '22 at 08:49
  • I need innerHTML here, because the internal structure is large and then I process it with BeautifulSoup4 – Andrew Mar 03 '22 at 08:52
  • Yeah, that sounds right. If I could see the html I could guess what you need and make a suggestion that's more efficient than parsing html with beautiful soup – pguardiario Mar 03 '22 at 10:55
  • Also, you could just make sure that your object is getting garbage collected inside your while loop, which it should be, but I think you didn't share the full code. – pguardiario Mar 03 '22 at 11:02
  • Hello @pguardiario, I've done a lot of testing, but RAM is loading slowly. Here is the structure: https://prnt.sc/LDmA-U3Y2b_X Each element has a large structure, and everything is needed – Andrew Mar 11 '22 at 19:12
  • Can garbage collector release li_elements at runtime? I found another option in the browser "Use hardware acceleration (if available)". This significantly reduced the consumption of RAM. – Andrew Mar 11 '22 at 19:17
  • It won't get garbage collected if there's a reference to it in scope – pguardiario Mar 12 '22 at 00:16