44

How do I get the visible text portion of a web page with selenium webdriver without the HTML tags?

I need something equivalent to the function HtmlPage.asText() from Htmlunit.

It is not enough to take the text with the function WebDriver.getSource and parse it with jsoup because there could be in the page hidden elements (by external CSS) which I am not interested in them.

Zoe
  • 27,060
  • 21
  • 118
  • 148
David Michael Gang
  • 7,107
  • 8
  • 53
  • 98
  • If you use firefox you can take a screenshot. If you need to actually have the text are you sure you need everything that is visible? Normally when I have to scrape something I only care about a few elements on the page. Also take a look at http://stackoverflow.com/questions/2646195/how-to-check-if-an-element-is-visible-with-webdriver – Joseph Helfert Aug 20 '13 at 15:42

3 Answers3

50

Doing By.tagName("body") (or some other selector to select the top element), then performing getText() on that element will return all of the visible text.

Thunderforge
  • 19,637
  • 18
  • 83
  • 130
Nathan Merrill
  • 7,648
  • 5
  • 37
  • 56
13

I can help you with C# Selenium.

By using this you can select all the text on that particular page and save it to a text file at your preferred location.

Make sure you are using this stuff:

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;

After reaching the particular page try using this code.

IWebElement body = driver.FindElement(By.TagName("body"));
var result = driver.FindElement(By.TagName("body")).Text;

// Folder location
var dir = @"C:Textfile" + DateTime.Now.ToShortDateString();

// If the folder doesn't exist, create it
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);

// Creates a file copiedtext.txt with all the contents on the page.
File.AppendAllText(Path.Combine(dir, "Copiedtext.txt"), result);
Cellcon
  • 1,245
  • 2
  • 11
  • 27
Anuraj S.L
  • 189
  • 2
  • 5
  • 3
    Man, people are mean. Why was this downvoted? Cause the person that answered added a lil' extra code to save what was captured to a textfile? It has all the same code as the ones that answered above. – IamBatman Oct 26 '16 at 20:57
7

I'm not sure what language you're using, but in C# the IWebElement object has a .Text method. That method shows all text that is displayed between the element's opening and closing tag.

I would create an IWebElement using XPath to grab the entire page. In other words, you're grabbing the body element and looking at the text in it.

string pageText = driver.FindElement(By.XPath("//html/body/")).Text;

If the above code does not work for selenium, use this:

string yourtext= driver.findElement(By.tagName("body")).getText();
Cellcon
  • 1,245
  • 2
  • 11
  • 27
Brantley Blanchard
  • 1,208
  • 3
  • 14
  • 23
  • I solved it with the command driver.findElement(By.tagName("body")).getText() – David Michael Gang Aug 21 '13 at 06:27
  • perfect. That looks to be the java equivalent to the C# code above. The key is to grab the body not html tag for efficiency. I tend to use XPath because of how easy it is to get xpath in Chrome but you can use By.cssSelector("body") or the By.tagName("body") as you used. They all select the same element. – Brantley Blanchard Aug 21 '13 at 13:48
  • "//html/body/" - this xpath is not valid due extra "/" at the end. Correct code is: string pageText = driver.FindElement(By.XPath("//html/body")).Text; – G. Victor Jan 03 '20 at 06:59