1

Script to manipulate some proper names in a web story to help my reading tool pronounce them correctly.

I get the content of a webpage via

$webpage = (Invoke-WebRequest -URI 'https://wanderinginn.com/2018/03/20/4-20-e/').Content

This $webpage should be of type String.

Now

$webpage.IndexOf('<div class="entry-content">')

returns correct value, yet

$webpage.IndexOf("Previous Chapter")

returns unexpected value and I need some explanation why or how I can find the error myself.

In theory it should cut the "body" of the page run it through a list of proper nouns I want to Replace and push this into a htm-file. It all works, but the value of IndexOf("Prev...") does not.

Edit: After invoke-webrequest I can

Set-Clipboard $webrequest

and post this in notepad++, there I can find both 'div class="entry-content"' and 'Previous Chapter'. If I do something like

Set-Clipboard $webpage.substring(
     $webpage.IndexOf('<div class="entry-content">'),
     $webpage.IndexOf('PreviousChapter')
   )

I would expect Powershell to correctly determine both first instances of those strings and cut between. Therefore my clipboard should now have my desired content, yet the string goes further than the first occurrence.

mklement0
  • 382,024
  • 64
  • 607
  • 775
J Lee
  • 21
  • 4
  • 2
    What unexpected value you get? What value you expect instead and why? – user4003407 Mar 14 '19 at 15:10
  • This works for me just fine `(Invoke-WebRequest -URI 'https://wanderinginn.com/2018/03/20/4-20-e/').Content.indexof('Previous Chapter')` which gets me 87859. What is wrong with that? Are you expecting a line number as supposed to a character number? – Matt Mar 14 '19 at 15:19
  • 2
    `IndexOf()` simply returns the integer index of the requested string. You need to use that information to cut out what you need. – Matt Mar 14 '19 at 15:28
  • 1
    the `.SubString()` method uses `StartIndex` alone OR `StartIndex`, Length`. you are giving it two start index numbers. ///// you need to set the 2nd number to the _difference between the two index values_. [*grin*] – Lee_Dailey Mar 14 '19 at 15:40
  • Oh man, I'm a putz. Thank you! I think I made it harder for myself because notepad++'s find always showed different character counts vs powershell. – J Lee Mar 14 '19 at 15:44

1 Answers1

1

tl;dr

  • You had a misconception about how String.Substring() method works: the second argument must be the length of the substring to extract, not the end index (character position) - see below.

  • As an alternative, you can use a more concise (albeit more complex) regex operation with
    -replace
    to extract the substring of interest in a single operation - see below.

  • Overall, it's better to use an HTML parser to extract the desired information, because string processing is brittle (HTML allows variations in whitespace, quoting style, ...).


As Lee_Dailey points out, you had a misconception about how the String.Substring() method works: its arguments are:

  • a starting index (0-based character position),
  • from which a substring of a given length should be returned.

Instead, you tried to pass another index as the length argument.

To fix this, you must subtract the lower index from the higher one, so as to obtain the length of the substring you want to extract:

A simplified example:

# Sample input from which to extract the substring 
#   '>>this up to here' 
# or, better,
#   'this up to here'.
$webpage = 'Return from >>this up to here<<'


# WRONG (your attempt): 
# *index* of 2nd substring is mistakenly used as the *length* of the
# substring to extract, which in this even *breaks*, because a length
# that exceeds the bounds of the string is specified.
$webpage.Substring(
  $webpage.IndexOf('>>'),
  $webpage.IndexOf('<<')
)

# OK, extracts '>>this up to here'
# The difference between the two indices is the correct length
# of the substring to extract.
$webpage.Substring(
  ($firstIndex = $webpage.IndexOf('>>')),
  $webpage.IndexOf('<<') - $firstIndex
)

# BETTER, extracts 'this up to here'
$startDelimiter = '>>'
$endDelimiter = '<<'
$webpage.Substring(
  ($firstIndex = $webpage.IndexOf($startDelimiter) + $startDelimiter.Length),
  $webpage.IndexOf($endDelimiter) - $firstIndex
)

General caveats re .Substring():

In the following cases this .NET method throws an exception, which PowerShell surfaces as a statement-terminating error; that is, by default the statement itself is terminated, but execution continues:

  • If you specify an index that is outside the bounds of the string (a 0-based character position less than 0 or one greater than the length of the string):

      'abc'.Substring(4) # ERROR "startIndex cannot be larger than length of string"
    
  • If you specify a length whose endpoint would fall outside the bounds of the string (if the index plus the length yields an index that is greater than the length of the string).

      'abc'.Substring(1, 3) # ERROR "Index and length must refer to a location within the string"
    

That said, you could use a single regex (regular expression) to extract the substring of interest, via the -replace operator:

$webpage = 'Return from >>this up to here<<'

# Outputs 'this up to here'
$webpage -replace '^.*?>>(.*?)<<.*', '$1'

The key is to have the regex match the entire string and extract the substring of interest via a capture group ((...)) whose value ($1) can then be used as the replacement string, effectively returning just that.

For more information about -replace, see this answer.

Note: In your specific case an additional tweak is needed, because you're dealing with a multiline string:

$webpage -replace '(?s).*?<div class="entry-content">(.*?)Previous Chapter.*', '$1'
  • Inline option ((?...)) s ensures that metacharacter . also matches newline characters (so that .* matches across lines), which it doesn't by default.

  • Note that you may have to apply escaping to the search strings to embed in the regex, if they happen to contain regex metacharacters (characters with special meaning in the context of a regex):

    • With embedded literal strings, \-escape characters as needed; e.g., escape .txt as \.txt

    • If a string to embed comes from a variable, apply [regex]::Escape() to its value first; e.g.:

          $var = '.txt'
          # [regex]::Escape() yields '\.txt', which ensures 
          # that '.txt' doesn't also match '_txt"
          'a_txt a.txt' -replace ('a' + [regex]::Escape($var)), 'a.csv'
      
mklement0
  • 382,024
  • 64
  • 607
  • 775