1

I want to write a function that takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL&guccounter=2 and extracts a number within a Div and a Span tag using JSOUP, using pattern matching.

Input: Looking to get values 153,982,000, 125,481,000, 105,392,000, 105,718,000 on the Current Liabilities row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.

On Inspect, these values are under the following tags.

Current Liabilities 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>153,982,000</span></div>

Current Liabilities 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>125,481,000</span></div>

Current Liabilities 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>105,392,000</span></div>

Current Liabilities 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>105,718,000</span></div>

Please fix my current code shown below, but also extract the row and column data instead of a string list:

private static List<String> balancefetch(String url) throws IOException {
    String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
    Document doc = Jsoup.connect(url).userAgent(userAgent1).get();
    List<String> balanceValues = new ArrayList<>();

    Element totalAssets = doc.select("[title=Total Assets]").first();
    Elements totalAssetsData = totalAssets.parent().siblingElements();
    for (Element e : totalAssetsData) {
        Log.d("totalAssetsData", e.text());
        balanceValues.add(e.text());
    } //It works so far.

The lines below don't work because there isn't a title=Current Liabilities, on the website.

Element currentLiabilities = doc.select("[title=Current Liabilities]").first();
        Elements currentLiabilitiesData = totalLiabilities.parent().siblingElements();
        for (Element e : currentLiabilitiesData) {
            Log.d("currentLiabilitiesData", e.text());
            balanceValues.add(e.text());
        }
        return balanceValues;
    }

Pleas help me get the output in the following format:

  • dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}

  • CurrentLiabilities = {153,982,000, 125,481,000, 105,392,000, 105,718,000 }

EDIT:

The Date values are under;

Please fix my code to extract the date values which are under:

Date1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(ib) Fw(b)"><span>9/30/2022</span></div>

Date2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(ib) Fw(b) Bgc($lv1BgColor)"><span>9/30/2021</span></div>

Date3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(ib) Fw(b)"><span>9/30/2020</span></div>

Date4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(ib) Fw(b) Bgc($lv1BgColor)"><span>9/30/2019</span></div>

Zac1
  • 208
  • 7
  • 34

1 Answers1

0
  • For the "Current Liabilities" extraction, the issue is that the data is available in the web page only after clicking on the arrow next to the "Total Liabilities Net Minority Interest" text (an empty div element is then populated with all data). Which means you cannot get data simply with jsoup. It seems it's possible to activate the button click with HtmlUnit (https://htmlunit.sourceforge.io/), then you get a string with the generated html containing the Current Liabilities, that you can parse with jsoup using the parse() method. Source of the info on stackoverflow: Can Jsoup simulate a button press?

  • Here is the code to access and display the Dates:

      List<String> dateValues = new ArrayList<>();
      Element breakdown = doc.select("span:contains(Breakdown)").first();
      Elements datesData = breakdown.parent().siblingElements();
      for (Element e : datesData) {
        System.out.println("date- "+ e.text()); // for test purpose
        dateValues.add(e.text());
      }
      System.out.println("dates = "+dateValues);
    

Result in the console: dates = [9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019]

Hope this helps.

EDIT : here is a code example of HtmlUnit, though it doesn't manage to retrieve the Current Liabilities data:

public static void main(String[] args) throws IOException {
    String url = "https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL&guccounter=2";

    try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.waitForBackgroundJavaScript(5000*2);
        HtmlPage page = webClient.getPage(url); // load the page
        List<DomElement> totalLiabilityButton = page.getByXPath("//button[@aria-label='Total Liabilities Net Minority Interest']"); // Get the arrow button to expand the liabilities
        HtmlPage page2 = totalLiabilityButton.get(0).click(); // click on the arrow button, to trigger the Javascript click event and populate the div element with the current liabilities data. Normally it should load the new page content into 'page' variable, but it doesn't work, I don't understand why.
        totalLiabilityButton = page2.getByXPath("//button[@aria-label='Total Liabilities Net Minority Interest']"); // reload the button
        DomNodeList<DomNode> children = totalLiabilityButton.get(0).getParentNode().getParentNode().getParentNode().getParentNode().getNextSibling().getChildNodes(); // Here you should have all 3 Current Liabilities data
    }
}
Sakuragi
  • 16
  • 4
  • There is no need for Jsoup if you are using HtmlUnit - see e.g. https://stackoverflow.com/questions/75252331/page-content-couldnt-be-seen-by-jsoup-and-httpclient/75255515#75255515 – RBRi Feb 03 '23 at 17:18
  • @Sakuragi Do you have an example of how to use HtmlUnit to extract the expandable row's values? – Zac1 Feb 06 '23 at 23:40
  • @Zac1 I've tried to extract the row values with HtmlUnit but cannot get them. Seems HtmlUnit cannot retrieve the new div content even by triggering the click event on the arrow "button" at "Total Liabilities Net Minority Interest". I don't know why. – Sakuragi Feb 27 '23 at 17:50
  • I've added the HtmlUnit code example in my answer, in case it can help – Sakuragi Feb 27 '23 at 18:00