0

I'm trying to read a list of all ZIP files on a web page and store them in a text file to download later. I can't use any 3rd party tools since this also needs to run on an ARM system as well as Windows 7, so built in commands only. I'm using batch script since it's basically universal in Windows.

I've started by getting the HTML of the website, which I got help with here: How can I find the source code for a website using only cmd?

That gives me the RAW HTML, which I then filter with FINDSTR

FINDSTR /I /C:.ZIP %~DP0FULLHTML.TXT>%~DP0ZIPLINES.TXT

The next step was to parse that file for the actual filenames, but I'm having difficulty because the web page uses a table to list the files, and that results in several lines that are over 19k characters in length. When I try to parse it with a FOR loop, it simply ignores these lines. I cannot figure out how to get this line shorter or split into shorter lines by some delimiter, I've even tried making the below PS1 file, but I know basically nothing about PowerShell scripting and can't seem to get it to work.

[CmdletBinding()]
Param(
[Parameter(Mandatory=$True,Position=1)]
[string]$file,

[Parameter(Mandatory=$True,Position=2)]
[string]$newfile
)

$contents = Get-Content $file

foreach ($line in $contents)
{
    $splititems = $line.split("/")

    foreach ($line in $splititems)
    {
        $line | Out-File $newfile
    }   
}

I then try running from in the batch file:

Powershell -ExecutionPolicy Bypass -File "%~DP0SPLIT.PS1" "%~DP0ZIPLINES.TXT" "%~DP0SPLITLINES.TXT"

This gives me an error saying I'm missing a } at the end.

I know that after searching on this site a bit that CMD has a variable limit of 8196 characters, which those lines exceed, hence the failure... And I'm sure I'm just completely messing up the PS code.

After I can get these big lines split into smaller ones, I have some messy code already that works to get the file names into a single TXT file. Don't know if there's one easy step in PS to just grab all the .ZIP filenames and shove them in a file.

Ryan Miller
  • 3
  • 1
  • 2
  • "*Don't know if there's one easy step in PS to just grab all the .ZIP filenames and shove them in a file.*" - pretty much, yes - `iwr https://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/ | % links |? OuterHTML -match 'TIFF' | % href | sc zips.txt` – TessellatingHeckler Oct 05 '17 at 04:04
  • I tried to copy and paste that into PowerShell, but I get this: The term 'iwr' is not recognized as the name of a cmdlet, function, script file, or operable program. – Ryan Miller Oct 05 '17 at 04:25
  • Did a little searching, looks like I can't do Invoke-WebRequest because it doesn't exist until PowerShell 3.0, Windows 7 has PowerShell 2.0. – Ryan Miller Oct 05 '17 at 04:30
  • 1
    https://stackoverflow.com/a/30207948/778560 – Aacini Oct 05 '17 at 05:55
  • Then put PowerShell 3, or 4, or 5.1 on it..? – TessellatingHeckler Oct 05 '17 at 07:29
  • Unfortunately, I need to go with what is default in Windows, no installations beyond what comes automatically by Windows Update. I'm going to try that SET /P method and see what comes of it, I'm worried it might break up the filenames into separate lines. – Ryan Miller Oct 05 '17 at 14:02
  • If you are willing to use the Batch + JScript approach to get the HTML, then that should also work for processing long lines. – lit Oct 05 '17 at 21:29

1 Answers1

0

The comment from Aacini lead me to a series of links that eventually brought me to this link: http://www.dostips.com/forum/viewtopic.php?f=3&t=6044

It's for a batch script called JREPL. I was able to run the following series of commands to leave me with a TXT file of only the links to the ZIP files:

CALL %~DP0JREPL.BAT "=" "\r\n" /X /L /F %~DP0FULLHTML.TXT /I /O %~DP0SPLITLINES.TXT
CALL %~DP0JREPL.BAT ">" "\r\n" /X /L /F %~DP0SPLITLINES.TXT /I /O -
FINDSTR /I /C:.ZIP %~DP0SPLITLINES.TXT>%~DP0ZIPFILES.TXT
Ryan Miller
  • 3
  • 1
  • 2