0

Hi want to extract all the text in the url patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PALL&s1=6700867.PN.

The text on this page are after "br /" tag

But when i try to extract text using tag name br it returns empty strings.

here is part of the code that i tried

Set HTMLbrs = HTMLDoc.getElementsByTagName("br")

For Each HTMLbr In HTMLbrs
        Debug.Print htmlbr.innertext
Next HTMLbr

The final aim of the code is to check whether a paragraph present in excel belongs to this website or not. the paragraphs could be from any section and its exact location cannot be known. the code aims to proofread that the text was taken from this website.

The response text that is receive has all the paragraphs but they have line breaks and
tags in the middle, due to which instr cannot be used.

When i tried to remove linefeeds by replace(responsetext,vblr," ") whole response text got divided into paragraphs each of 1023 characters, because of which i could not use instr again.

I have used application.worksheet.clean(responsetext). replace(responsetext,vbcr," ") and replace( responsetext,vbcrlr," ") but none of them gave desired result.

user1987
  • 3
  • 3
  • Your provided link doesn't produce any result. – SIM Jun 11 '20 at 06:19
  • some how when i paste the link here, it does not open the page, Could you please copy the link here and paste it in your browser, that seems to work patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PALL&s1=6700867.PN. – user1987 Jun 11 '20 at 07:11
  • The `br` tags you're trying to scrape are not closed due to bad design of the website. If you take a look at the HTML source code you'll see the opening tag `
    ` but the closing tag ` is nowhere to be found. Which **specific** part of the webpage do you need?
    – Stavros Jon Jun 11 '20 at 10:30
  • i have set of paragraphs in an excel, i need to check whether the paragraphs belongs to the given webpage, (part of reviewing a report). the paragraphs could be from "claims" section or description section – user1987 Jun 11 '20 at 11:15
  • If it is ensured that the paragraphs to be compared can only occur on a certain page, then you only need to check with `inStr()` whether the paragraph you are looking for is part of the text of the whole web page. – Zwenn Jun 11 '20 at 11:31
  • @zwenn the issue is the line breaks because of which i cannot use instr(), i need to remove the line breaks, in which i havent been successful. – user1987 Jun 11 '20 at 11:49
  • If you get the whole text from the body tag with `innertext` all br tags are `vbCrLf` after that. Like I show in my answer below. – Zwenn Jun 11 '20 at 11:55

2 Answers2

0

When scraping web you may consider using Puppeteer : https://github.com/puppeteer/puppeteer

That aside, "br" is used to do line break so it is normal for you to get empty strings as there is nothing inside your tag.
https://www.w3schools.com/tags/tag_br.asp

I think that the best answers to the two followings posts may help you :
Reading HTML file in VBA Excel
VBA to fetch html URL from webpage

The way I would do it is by getting the whole HTML and save it in a variable or text file (depending of the size). Then i would use string manipulation in VBA to get the part situated between "br" tags.
https://www.excel-easy.com/vba/string-manipulation.html

Sriks
  • 15
  • 6
0

Edit

Here is an example to scrape the whole page text as a block without any html tags:

Sub PatentScrapeWholeText()

Dim url As String
Dim http As Object
Dim htmlDoc As Object
Dim pageText As String

  'Initialize variables
  url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PALL&s1=6700867.PN."
  Set htmlDoc = CreateObject("htmlfile")
  Set http = CreateObject("MSXML2.XMLHTTP.6.0")

  'Load page
  http.Open "GET", url, False
  http.send

  'Check if page loading was successful
  If http.Status = 200 Then
    'Build html document for DOM operations
    htmlDoc.body.innerHTML = http.responseText

    'Get page text without any html tags
    pageText = htmlDoc.getElementsByTagName("body")(0).innertext

    'Here you can see the first part of the page text
    'You can delete this, it's only to show you the text is plain
    MsgBox pageText

    '************************************************************
    'Compare your paragraphs here with pageText
    '************************************************************
  Else
      'Page not loaded
      MsgBox "Error with website address"
  End If
End Sub

Original posting

Uiuiui, a page from the deepest 90s. Badly structured. Here is an example of how to get the text you want. For text from other areas you have to find your own solution. With this table, it went over the width of the table. This is the only one with 90%.

Sub PatentScrape()

Dim url As String
Dim http As Object
Dim htmlDoc As Object
Dim nodeAllTables As Object
Dim nodeOneTable As Object
Dim splitArray() As String
Dim paragraph As Long
Dim currentRow As Long

  'Initialize variables
  url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PALL&s1=6700867.PN."
  currentRow = 2
  Set htmlDoc = CreateObject("htmlfile")
  Set http = CreateObject("MSXML2.XMLHTTP.6.0")

  'Load page
  http.Open "GET", url, False
  http.send

  'Check if page loading was successful
  If http.Status = 200 Then
    'Build html document for DOM operations
    htmlDoc.body.innerHTML = http.responseText

    'Create node collection from all tables of the page
    Set nodeAllTables = htmlDoc.getElementsByTagName("table")

    'Search for the table with 90% width attribute
    For Each nodeOneTable In nodeAllTables
      If nodeOneTable.getAttribute("width") = "90%" Then
        'Found the right table
        Exit For
      End If
    Next nodeOneTable

    'The br tags were automatically converted to normal line breaks by Excel
    'We use this to split the text into its paragraphs
    splitArray = Split(nodeOneTable.innertext, vbCrLf)

    'Write all paragraphs to the active Excel sheet
    For paragraph = 0 To UBound(splitArray)
      ActiveSheet.Cells(currentRow, 1).Value = splitArray(paragraph)
      currentRow = currentRow + 1
    Next paragraph
  Else
      'Page not loaded
      MsgBox "Error with website address"
  End If
End Sub
Zwenn
  • 2,147
  • 2
  • 8
  • 14