tl;dr
You had a misconception about how String.Substring()
method works: the second argument must be the length of the substring to extract, not the end index (character position) - see below.
As an alternative, you can use a more concise (albeit more complex) regex operation with
-replace
to extract the substring of interest in a single operation - see below.
Overall, it's better to use an HTML parser to extract the desired information, because string processing is brittle (HTML allows variations in whitespace, quoting style, ...).
As Lee_Dailey points out, you had a misconception about how the String.Substring()
method works: its arguments are:
- a starting index (
0
-based character position),
- from which a substring of a given length should be returned.
Instead, you tried to pass another index as the length argument.
To fix this, you must subtract the lower index from the higher one, so as to obtain the length of the substring you want to extract:
A simplified example:
# Sample input from which to extract the substring
# '>>this up to here'
# or, better,
# 'this up to here'.
$webpage = 'Return from >>this up to here<<'
# WRONG (your attempt):
# *index* of 2nd substring is mistakenly used as the *length* of the
# substring to extract, which in this even *breaks*, because a length
# that exceeds the bounds of the string is specified.
$webpage.Substring(
$webpage.IndexOf('>>'),
$webpage.IndexOf('<<')
)
# OK, extracts '>>this up to here'
# The difference between the two indices is the correct length
# of the substring to extract.
$webpage.Substring(
($firstIndex = $webpage.IndexOf('>>')),
$webpage.IndexOf('<<') - $firstIndex
)
# BETTER, extracts 'this up to here'
$startDelimiter = '>>'
$endDelimiter = '<<'
$webpage.Substring(
($firstIndex = $webpage.IndexOf($startDelimiter) + $startDelimiter.Length),
$webpage.IndexOf($endDelimiter) - $firstIndex
)
General caveats re .Substring()
:
In the following cases this .NET method throws an exception, which PowerShell surfaces as a statement-terminating error; that is, by default the statement itself is terminated, but execution continues:
If you specify an index that is outside the bounds of the string (a 0
-based character position less than 0
or one greater than the length of the string):
'abc'.Substring(4) # ERROR "startIndex cannot be larger than length of string"
If you specify a length whose endpoint would fall outside the bounds of the string (if the index plus the length yields an index that is greater than the length of the string).
'abc'.Substring(1, 3) # ERROR "Index and length must refer to a location within the string"
That said, you could use a single regex (regular expression) to extract the substring of interest, via the -replace
operator:
$webpage = 'Return from >>this up to here<<'
# Outputs 'this up to here'
$webpage -replace '^.*?>>(.*?)<<.*', '$1'
The key is to have the regex match the entire string and extract the substring of interest via a capture group ((...)
) whose value ($1
) can then be used as the replacement string, effectively returning just that.
For more information about -replace
, see this answer.
Note: In your specific case an additional tweak is needed, because you're dealing with a multiline string:
$webpage -replace '(?s).*?<div class="entry-content">(.*?)Previous Chapter.*', '$1'
Inline option ((?...)
) s
ensures that metacharacter .
also matches newline characters (so that .*
matches across lines), which it doesn't by default.
Note that you may have to apply escaping to the search strings to embed in the regex, if they happen to contain regex metacharacters (characters with special meaning in the context of a regex):
With embedded literal strings, \
-escape characters as needed; e.g., escape .txt
as \.txt
If a string to embed comes from a variable, apply [regex]::Escape()
to its value first; e.g.:
$var = '.txt'
# [regex]::Escape() yields '\.txt', which ensures
# that '.txt' doesn't also match '_txt"
'a_txt a.txt' -replace ('a' + [regex]::Escape($var)), 'a.csv'