0

I have 1000 of URLs. I need a tools that get my URLs and export all the text that appears on those pages. I need the texts that a is shown on the webpages, not the background html code.

Do you know any software or way to do it?

Dharman
  • 30,962
  • 25
  • 85
  • 135
TNT2631
  • 61
  • 1
  • 6

1 Answers1

2

Save this as bat file (i.e. innerTextGet.bat):

@if (@X)==(@Y) @end /* JScript comment 
        @echo off 

        cscript //E:JScript //nologo "%~f0" %* 
        ::pause
        exit /b %errorlevel% 

@if (@X)==(@Y) @end JScript comment */ 


var link=WScript.Arguments.Item(0);
var saveTo=WScript.Arguments.Item(1);


var IE = new ActiveXObject("InternetExplorer.Application"); 
IE.Visible=false;
IE.Navigate2(link);

function sleep(milliseconds) {
  var start = new Date().getTime();
  for (var i = 0; i < 1e7; i++) {
    if ((new Date().getTime() - start) > milliseconds){
      break;
    }
  }
}

var counter=0;
while (IE.Busy && counter<60*60*10) {
    //WScript.Echo(IE.Busy);
    sleep(1000);
    counter++;
}

if(IE.Busy){
    WScript.Echo("Cant wait 4ever");
    WScript.Quit(10);
}

function writeContent(file,content) {
        var ado = WScript.CreateObject("ADODB.Stream");
        ado.Type = 2;  // adTypeText = 2
        ado.CharSet = "iso-8859-1";  // right code page for output (no adjustments)
        //ado.Mode=2;
        ado.Open();

        ado.WriteText(content);
        ado.SaveToFile(file, 2);
        ado.Close();    
}

var innerText=IE.document.body.innerText;
IE.Quit();
writeContent(saveTo,innerText);

And use it like:

call innerTextGet.bat "https://stackoverflow.com/questions/46611374/save-texts-on-webpages-1000-pages"  result.txt

It is not fail safe - does not check if the result file already exists if the parameters are correctly passed and so on, but it works at least. It again uses innerText property of InternetExplorer.Application object as proposed by @omegastripes though I prefer jscript because it is easier to be plugged into batch file.

As you gave no information about where the links are stored I assume you know how to read and iterate through them.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
npocmaka
  • 55,367
  • 18
  • 148
  • 187