Get visible text of page

Question

How do I get the visible text portion of a web page with selenium webdriver without the HTML tags?

I need something equivalent to the function HtmlPage.asText() from Htmlunit.

It is not enough to take the text with the function WebDriver.getSource and parse it with jsoup because there could be in the page hidden elements (by external CSS) which I am not interested in them.

If you use firefox you can take a screenshot. If you need to actually have the text are you sure you need everything that is visible? Normally when I have to scrape something I only care about a few elements on the page. Also take a look at http://stackoverflow.com/questions/2646195/how-to-check-if-an-element-is-visible-with-webdriver — Joseph Helfert, Aug 20 '13 at 15:42

score 50 · Accepted Answer · edited Dec 08 '14 at 20:52

50

Doing By.tagName("body") (or some other selector to select the top element), then performing getText() on that element will return all of the visible text.

edited Dec 08 '14 at 20:52

Thunderforge

19,637
18
83
130

answered Aug 20 '13 at 14:57

Nathan Merrill

7,648
5
37
56

What kind of object is "By"? – User Feb 16 '14 at 23:35
@macdonjo It's the way that Selenium separates their selectors. `driver.findElement(By.selectorType("selector"))` http://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/By.html – Nathan Merrill Feb 17 '14 at 14:14
1

Oh, I figured it out. I'm Python and that's the Java syntax. Thanks! – User Feb 17 '14 at 15:43
3

In python, the getText() method does not exists. Instead, we should use element.text – Iching Chang Apr 21 '17 at 07:56

score 13 · Answer 2 · edited Nov 09 '18 at 10:00

I can help you with C# Selenium.

By using this you can select all the text on that particular page and save it to a text file at your preferred location.

Make sure you are using this stuff:

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;

After reaching the particular page try using this code.

IWebElement body = driver.FindElement(By.TagName("body"));
var result = driver.FindElement(By.TagName("body")).Text;

// Folder location
var dir = @"C:Textfile" + DateTime.Now.ToShortDateString();

// If the folder doesn't exist, create it
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);

// Creates a file copiedtext.txt with all the contents on the page.
File.AppendAllText(Path.Combine(dir, "Copiedtext.txt"), result);

Man, people are mean. Why was this downvoted? Cause the person that answered added a lil' extra code to save what was captured to a textfile? It has all the same code as the ones that answered above. — IamBatman, Oct 26 '16 at 20:57

score 7 · Answer 3 · edited Nov 09 '18 at 12:58

7

I'm not sure what language you're using, but in C# the IWebElement object has a .Text method. That method shows all text that is displayed between the element's opening and closing tag.

I would create an IWebElement using XPath to grab the entire page. In other words, you're grabbing the body element and looking at the text in it.

string pageText = driver.FindElement(By.XPath("//html/body/")).Text;

If the above code does not work for selenium, use this:

string yourtext= driver.findElement(By.tagName("body")).getText();

edited Nov 09 '18 at 12:58

Cellcon

1,245
2
11
27

answered Aug 20 '13 at 19:26

Brantley Blanchard

1,208
3
14
23

I solved it with the command driver.findElement(By.tagName("body")).getText() – David Michael Gang Aug 21 '13 at 06:27
perfect. That looks to be the java equivalent to the C# code above. The key is to grab the body not html tag for efficiency. I tend to use XPath because of how easy it is to get xpath in Chrome but you can use By.cssSelector("body") or the By.tagName("body") as you used. They all select the same element. – Brantley Blanchard Aug 21 '13 at 13:48
"//html/body/" - this xpath is not valid due extra "/" at the end. Correct code is: string pageText = driver.FindElement(By.XPath("//html/body")).Text; – G. Victor Jan 03 '20 at 06:59

Get visible text of page

3 Answers3

Linked

Related