0

need to extraxt a value(dip) out of a html

</span><span class="pron dpron">/<span class="ipa dipa lpr-2 lpl-1">buːm</span>/</span></span></div><div class="pos-body">

my code leads into: microsoft jscript runtime error object doesn't support this property or method

@if (@CodeSection == @Batch) @then

@echo off
setlocal

curl https://dictionary.cambridge.org/de/worterbuch/englisch/boom >phoneme.html

set "htmlfile=phoneme.html"

rem // invoke JScript hybrid code and capture its output
for /f %%I in ('cscript /nologo /e:JScript "%~f0" "%htmlfile%"') do set "converted=%%I"

echo %converted%

rem // end main runtime
PAUSE
goto :EOF

@end // end batch / begin JScript chimera

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('htmlfile'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll();

DOM.write(html);
htmlfile.Close();

var scrape = DOM.getElementsByTagName('pron dpron').getElementsByClassName('ipa dipa lpr-2 lpl-1')[0].innerText;
WSH.Echo(scrape.match(/^.*=\s+(\S+).*$/)[0]);

copy&pasted this and slightly edited.

need to get "bu:m" into a value or echoed.

Many thanks.

  • Using an HTML-parser, like [tag:xidel], would imo be much simpler: `xidel -s "https://dictionary.cambridge.org/de/worterbuch/englisch/boom" -e "(//span[@class='us dpron-i '])[1]/span/span[@class='ipa dipa lpr-2 lpl-1']"`. – Reino Mar 11 '23 at 12:58
  • @Reino: Thank you! xidel -s "https://dictionary.cambridge.org/de/worterbuch/englisch/boom" -e "(//span[@class='pron dpron']/span[@class='ipa dipa lpr-2 lpl-1'])[1]" did the trick, now I need to find a way to get the Unicode out of it. :) – Franklyn W. R. Tigier Mar 11 '23 at 16:54

3 Answers3

1

You don't need JScript in order to extract such a value from the .html file; you can do it directly with a Batch file.

If the structure of the desired line is always the same:

<span class="pro">/<span class="dip">buːm</span>/</span>

... you can do it as simple as this line:

for /F "tokens=3 delims=</>" %%a in ('findstr "\"dip\"" phoneme.html') do set "dip=%%a"
echo %dip%

If the line could change, first get the line with "dip" value via a findstr command, and then extract the dip value:

for /F "delims=" %%a in ('findstr "\"dip\"" phoneme.html') do set "html=%%a"
set "dip=%html:*"dip">=%"
set "dip=%dip:<=" & rem "%"
echo %dip%

New code added

This new method was designed and extracted from OP's comments...

1- In your question you specified that you are looking for this string: "dip". However, in your comment it seems that the real string you want is this: "ipa dipa lpr-2 lpl-1". Please, note that the second string is very different than the first one because it contain spaces and most Batch commands are sensitive to spaces, so the code must be modified accordingly. BTW it is very bad "netiquette" that you provide us a certain data, test the code we wrote with different data, and then you say: "Your code not works"! Did you tested our code with the data you provided?

2- In my answer I specified: "If the structure of the desired line is always the same:"

<span class="pro">/<span class="dip">buːm</span>/</span>

However, it seems that the real line is very different:

</span><span class="pron dpron">/<span class="ipa dipa lpr-2 lpl-1">buːm</span>/</span></span> <span class="us dpron-i "><span class="region dreg">us</span><span class="daud"> converted= /span span class="pron dpron" / span class="ipa dipa lpr-2 lpl-1" buːm /span / /span /span span class="us dpron-i " span class="region dreg" us /span span class="daud"

I added: "If the line could change..." use the second code.

Why did you tested the first code if the real line is entirely different than the line you posted? You should use the second code instead... The aid could over-complicate if simple instructions are not followed...

3- In your comment you indicated that the html file is created with this line:

curl dictionary.cambridge.org/de/worterbuch/englisch/boom

When I tested such a line I got this:

<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

... but your complaint was: I just get: 'a href='https:'

I really don't know what else to say...


I prepared a test file with this contents:

Any other line...
</span><span class="pron dpron">/<span class="ipa dipa lpr-2 lpl-1">buːm</span>/</span></span> <span class="us dpron-i "><span class="region dreg">us</span><span class="daud"> converted= /span span class="pron dpron" / span class="ipa dipa lpr-2 lpl-1" buːm /span / /span /span span class="us dpron-i " span class="region dreg" us /span span class="daud"
Any other line...

This is the new code:

@echo off
setlocal EnableDelayedExpansion

REM  curl dictionary.cambridge.org/de/worterbuch/englisch/boom > phoneme.html

for /F "delims=" %%a in ('findstr /C:"\"ipa dipa lpr-2 lpl-1\"" phoneme.html') do set "html=%%a"
set "dip=%html:*"ipa dipa lpr-2 lpl-1">=%"
set "dip=%dip:<=" & rem "%"
echo %dip%

... and this is the output:

buːm

It seems that the output contain an Unicode character that, of course, can not be properly managed by a Batch file... :(

PS - The Unicode character could be properly generated if chcp 65001 command is used...

Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Hmm - `set "firststring=%wholestring:uniquesubstring=" & set stringafter="%"` ... Good voodoo – Magoo Mar 08 '23 at 19:40
  • I just get: 'a href='https:' – Franklyn W. R. Tigier Mar 09 '23 at 10:30
  • '@echo off setlocal curl https://dictionary.cambridge.org/de/worterbuch/englisch/boom >factions_phoneme.html set "htmlfile=factions_phoneme.html" @ECHO OFF SETLOCAL for /F "tokens=3 delims=>" %%a in ('findstr "\"ipa dipa lpr-2 lpl-1\"" factions_phoneme.html') do set "dip=%%a" echo %dip% Pause GOTO :EOF' – Franklyn W. R. Tigier Mar 09 '23 at 10:32
  • @Magoo: Do you like this voodoo? I invite you to review [the whole story](https://www.dostips.com/forum/viewtopic.php?f=3&t=6429)... But beware! It could be dangerous... Really! **`;)`** – Aacini Mar 09 '23 at 17:54
  • @Aacini : Evidently an old trick, but I'd not seen it before. Any clue why it's not used on SO more often? – Magoo Mar 09 '23 at 18:34
  • @Aacini: first, I'm sorry. It was not my intention to break the netiquette. second: I wasn't aware that spaces have such an impact. In my commented code, the https is gone in StackOverflow. so my link is still working. with your new code I get back: '/╦êm╩îs.t├ª╩â/' - but you already said that it is not possible to handle Unicode. so I still ned a JScript code? Many thanks. Edit: edited the original code in my post to the used link. – Franklyn W. R. Tigier Mar 09 '23 at 20:25
  • @Magoo: Mmmm... If you open my page and search for "split", you'll get _a lot_ of answers based on this method. like [this](https://stackoverflow.com/questions/23600775/split-string-with-string-as-delimiter/33131797#33131797), or [this](https://stackoverflow.com/questions/48808255/split-by-three-spaces-batchscript/48811742#48811742), etc... [This](https://stackoverflow.com/a/48696809/778560) is an answer from another author, and in [this question](https://stackoverflow.com/questions/70436066/extract-substring-with-set-command-in-cmd-script) there are 3 similar answers from different people. – Aacini Mar 13 '23 at 04:02
0
@ECHO OFF
SETLOCAL

SET "converted=<span class="pro">/<span class="dip">bu:m</span>/</span>"
SET converted
SET "converted=%converted:<= %"
SET "converted=%converted:>= %"
SET "grab="
FOR %%e IN (%converted%) DO IF DEFINED grab (
 SET "converted=%%e"
 GOTO done
 ) ELSE IF %%e=="dip" SET "grab=y"
:done
SET converted
GOTO :EOF

Replace the redirectors with spaces, then process the value as a series of space-separated tokens. When the token "dip" appears, set grab to obtain the next token, grab it and exit the for.

Note : converted will be mangled if "dip" does not appear.
And : What appears to be a colon between bu and m is some unicode character. I replaced it with a colon for testing.

Magoo
  • 77,302
  • 8
  • 62
  • 84
  • Hi, the output is following: 'converted=/bu╦Ém/ us converted= /span span class="pron dpron" / span class="ipa dipa lpr-2 lpl-1" bu╦Ém /span / /span /span span class="us dpron-i " span class="region dreg" us /span span class="daud"' – Franklyn W. R. Tigier Mar 09 '23 at 10:28
  • Since your new string does not contain `"dip"`, you need to change `"dip"` to `"ipa dipa lpr-2 lpl-1"` to match the appropriate substring. – Magoo Mar 09 '23 at 18:44
  • changed it and it seems to work, but as Aacini said there is still a issue with the Unicode. :( Do you have an idea how to get this run/fixed? – Franklyn W. R. Tigier Mar 09 '23 at 20:37
0

Thank you for all the tips. With @Reino and Echo-ing unicode character I was able to get what I need.

@ECHO OFF
chcp 65001

xidel -s "https://dictionary.cambridge.org/de/worterbuch/englisch/boom" -e "(//span[@class='pron dpron']/span[@class='ipa dipa lpr-2 lpl-1'])[1]"

PAUSE
GOTO :EOF
  • I indicated to use `chcp 65001` command to generate the Unicode character in [my answer](https://stackoverflow.com/a/75677045/778560) three days ago... – Aacini Mar 13 '23 at 04:15