1

I am trying to load the text from a web document with bs4 in python (3.7). However, I do not get all the text between paragraphs. Here is what I have tried:

import bs4
import requests

url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
#soup = bs4.BeautifulSoup(page.txt, 'html.parser')   <-- adding the html.parser does not work either
soup = bs4.BeautifulSoup(page.content)
soup.find_all('p')  # Only returns the text at the bottom of the document
soup.find_all('p', {'class': 'q'})  # returns nothing
soup.getText()  # Does not return the document text either

There is a shadow-root before the document text. I do not known, what it does or if it influences the results?

When I save the page.content to a text file, the document text is text, that looks quite different from ordinary html tags, e.g.:

;p class=\&q;q\&q;&g;On 2 March 2021 the Company purchased&a;#160;1,000,000 ordinary shares at a price of &a;#160;1,193.07 p The shares purchased will be held in Treasury.&l;/p&g;\r\n&l;p class=\&q;v\&q;&g;Following the transaction the Company holds&a;#160;48,356,074 shares in Treasury.&l;/p&g

My question: What is coursing this behavior and how do I extract the text from the document?

RVA92
  • 666
  • 4
  • 17
  • 1
    It looks that you forgot to use 'html.parser' with bs4: ```soup = bs4.BeautifulSoup(page.text, 'html.parser')``` should solve your problem. – Panda50 May 07 '21 at 07:37
  • 1
    Adding the html.parser does not solve the problem, unfortunately – RVA92 May 07 '21 at 07:40

2 Answers2

1

The text is encoded using Angular's custom encoder and can be found in the script tag. You can load the data in this tag with json() after cleaning it up. Then find the article text in the dictionary and parse the html again with BeautifulSoup, eg:

import json
from bs4 import BeautifulSoup
import requests

url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
soup = BeautifulSoup(page.content)
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)

text = BeautifulSoup(data['G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14885136&path=news-article']['body']['components'][1]['content']['newsArticle']['value'], 'html.parser')

print(text.find('body').get_text(strip=True, separator='\n')) will output:

RNS Number : 9260Q
Scottish Mortgage Inv Tst PLC
02 March 2021
Scottish Mortgage Investment Trust PLC
Legal Entity Identifier:
213800G37DCS3Q9IJM38
Purchase of Own Securities
On 2 March 2021 the Company purchased 1,000,000 ordinary shares at a price of  1,193.07 p The shares purchased will be held in Treasury.
Following the transaction the Company holds 48,356,074 shares in Treasury.
The shares in issue less the total number of shares in Treasury are 1,436,424,806
The above figure ( 1,436,424,806 ) may be used by shareholders as the denominator for the calculations by which they will determine if they are required to notify their interest in, or a change to their interest in, Scottish Mortgage Investment Trust PLC under the FCA's Disclosure and Transparency Rules.
Baillie Gifford & Co Limited
Company Secretaries
2 March 2021
Regulated Information Classification:
Acquisition or disposal of the
issuer's own shares
This information is provided by RNS, the news service of the London Stock Exchange. RNS is approved by the Financial Conduct Authority to act as a Primary Information Provider in the United Kingdom. Terms and conditions relating to the use and distribution of this information may apply. For further information, please contact
rns@lseg.com
or visit
www.rns.com
.
RNS may use your IP address to confirm compliance with the terms and conditions, to analyse how you engage with the information contained in this communication, and to share such analysis on an anonymised basis with others as part of our commercial services. For further information about how RNS and the London Stock Exchange use the personal data you provide us, please see our
Privacy Policy
.
END
POSDKBBDOBKDONK
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
0

I'll ad an answer to post pictures.

I think I got your problem. You're trying to scrape something with requests which make a bit of time to appears and, I think, the website is asking to a database to display what you want.

Using this:

import bs4
import requests

url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
soup = bs4.BeautifulSoup(page.text, 'html.parser')
soup.find_all('p')
soup.find_all('p', {'class': 'q'})
print(soup.text)

I have, as you maybe know, this result:


Transaction in Own Shares - 17:18:09 02 Mar 2021 - SMT News article | London Stock Exchange













Discover Discover Start your journey hereDiscover the world’s international exchangeOur regions Our regions United KingdomEuropeRussia, CIS and Central AsiaMiddle EastIsraelAfricaNorth and South AmericaAsia PacificChinaShanghai London Stock ConnectNews and insights News and insights LatestWelcome storiesInsights and reportsPress releasesEvents Events Welcome storiesLondon Stock Exchange Market CeremoniesLSEG LSEG LSEG products and servicesOur historyPrices and markets search Our regions News Shanghai London Stock ConnectNews and Prices News and Prices Start your journey hereNews and PricesFTSE indices FTSE indices FTSE 100FTSE 250FTSE 350FTSE All-SharePrices and Markets Prices and Markets Search by MarketBond searchRetail bonds searchETFs searchAdvanced bond searchNews News Today's newsRegulatory news (RNS)Reports Reports Issuers and instrumentsPrimary marketsSecondary marketsGroup statisticsNew issues New issues Upcoming issuesRecent issuesFTSE index values Tools and Services Risers and Fallers Personal InvestingRaise finance Raise finance Start your journey hereRaise financeEquity Equity AIMMain MarketCompare markets for listing equityGreen Economy MarkCalculating feesAssess your funding optionsHow to list equityUseful assetsDebt Debt Main MarketInternational Securities MarketSustainable Bond MarketCompare our debt marketsIssuer Services FlowOur productsCalculating feesHow to list debtUseful assetsETPs ETPs ETFsETCs and ETNsCalculating feesHow to list ETPsUseful assetsFunds Funds Compare markets for listing fundsSpecialist Fund SegmentListed Real Estate HubSPACsHow to list fundsUseful assetsFocus Focus Issuer ServicesSustainable FinanceAdmissions Self Service PortalNominated Advisers Issuer Services Prices and markets search Raise finance FAQsTrade Trade Start your journey hereTradeEquity trading Equity trading UK and European SecuritiesInternational Order BookGlobal Equity SegmentOff-book trade reportingUseful assetsDebt trading Debt trading Order Book for Retail BondsOrder Book for Fixed Income SecuritiesUseful assetsETP trading ETP trading ETPs - ICSD Settlement Trading ServiceRequest for QuoteETP Market Makers directoryUseful assetsMembership Membership Member firm directoryMember firm information sheetsUseful assetsUseful links Useful links Trading accessSpecial conditionsStamp Duty ExemptionTechnical libraryTrading access CurveGlobal Markets Turquoise Technical 
libraryPersonal investing Personal investing Start your journey herePersonal investing hubTools Tools My accountPrices and markets searchEmail alertsVirtual Portfolio and WatchlistHistorical Price ServiceFind a BrokerDirect Market AccessBroker directory Broker directory Execution onlyAdvisoryDiscretionaryAll articles All articles Why investing mattersWhat should you consider before investingWhat are stocks & sharesFTSE indices Prices and markets search FAQs Exchanging ideas: Impact investingResources Resources Start your journey hereResourcesRaise finance Raise finance Main MarketAIMDebtExchange 
Traded ProductsFundsTrade Trade Rules and regulationsTechnical libraryEquityDebtExchange Traded ProductsMembershipLondon Stock Exchange notices London Stock Exchange notices 2021202020192018201720162015Service Announcements Service Announcements 20212020201920182017FAQs FAQs Website FAQsMain MarketAIMDebtExchange Traded ProductsFundsAIM Notices Trading access Reports Prices and Markets Go to News Explorer RNSTransaction in Own Shares Share this article SCOTTISH MORTGAGE INVESTMENT TRUST PLC Transaction in Own Shares London Stock ExchangeSCOTTISH MORTGAGE INVESTMENT TRUST PLC Released 17:18:09 02 
March 202102 March 2021London Stock Exchange plc is not responsible for and does not check content on this Website. Website users are responsible for checking content. Any news item (including any prospectus) which is addressed solely to the persons and countries specified therein should not be relied upon other than by such persons and/or outside the specified countries. Terms and conditions, including restrictions on use and distribution apply.
 © 2021 London Stock Exchange plc. All rights reserved.

which correspond to the text at the bottom of the document but it's because, when you load the page, this will display first:

enter image description here

To avoid this problem I would recommend you to scrape your data using selenium and not requests which will make things easier.

Here is an example of how to do : How can I parse a website using Selenium and Beautifulsoup in python?

Panda50
  • 901
  • 2
  • 8
  • 27