1

Basically I have a huge csv of phishing links and I'm trying to trim off https://www. and anything after .com .edu etc. so basically the ideal ouput of the powershell script would be a long list of urls all of which look something like google.com or microsoft.com so far I have imported the csv but everything I have tried either doesn't work or leaves the www on the beggining. Any help would be great. The csv im using is this: http://data.phishtank.com/data/online-valid.csv

$urls = Import-Csv -Path .\online-valid.csv | select -ExpandProperty "url"
Loaf7
  • 13
  • 2
  • 1
    run this `[URI]'http://www.phishtank.com/phish_detail.php?phish_id=6429209'` and you're half there. ;-) – Olaf Mar 03 '20 at 02:44

2 Answers2

1

The below will take your CSV and do magic for you. Have a play around with [Uri], it is very useful when parsing web links.

$csv = import-csv C:\temp\verified_online.csv

Foreach($Site in $csv) {
    $site | Add-Member -MemberType NoteProperty -Name "Host" -Value $(([Uri]$Site.url).Host -replace '^www\.')
}

$csv | Export-Csv C:\temp\verified_online2.csv -NoTypeInformation

Adjusted based on recommendation from Mklement0.

Drew
  • 3,814
  • 2
  • 9
  • 28
1

A concise and fast alternative to Drew's helpful answer based on casting the URL strings directly to an array of [uri] (System.Uri) instances, and then trimming prefix www., if present, from their .Host (server name) property:

([uri[]] (Import-Csv .\online-valid.csv).url).Host -replace '^www\.'

Note that the -replace operator is regex-based, and regex ^www\. makes sure what www is only replaced at the start (^) of the string, and only if followed by a literal . (\.), in which case this prefix is removed (replaced with the implied empty string); if no such prefix is present, the input string is passed through as-is.

The solution reads the entire CSV file into memory at once, for convenience and speed, and outputs just the trimmed server names, as an array of strings.

mklement0
  • 382,024
  • 64
  • 607
  • 775