0

Hi i have used this line of code for lambda function which basically extracts a pdf url's text and i have a column with just different pdf links and coded a lambda function which basically takes in the pdf links column link and returns the text of the pdf in another column called Content .

result = result.assign(Content = lambda x: ( urltotext(x['Source']) ))

Here's my whole code

import requests
import pandas as pd
from datetime import datetime
from datetime import date
import json
import urllib.request
import PyPDF2
import fitz
import multiprocessing
import requests_cache 


def urltotext(link):
    try:
        req = urllib.request.urlopen(link)
        file = open("DailyCA.pdf",'wb')
        file.write(req.read())
        file.close()
        doc = fitz.open("DailyCA.pdf") 
        text = []
        for page in doc:
            temptext = page.get_text('text')
            text.append(temptext)
        text = 'shodhpage'.join(map(str, text))
        doc.close()
        return text
    except Exception as e:
        text = "none"
        return text


def all():
    print("Started Pulling")
    currentd = date.today()
    s = requests_cache.CachedSession('demo_cache', backend='sqlite')
    headers =   {'Host':'www.nseindia.com','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language':'en-US,en;q=0.5', 'Accept-Encoding':'gzip, deflate, br','DNT':'1', 'Connection':'keep-alive', 'Upgrade-Insecure-Requests':'1','Pragma':'no-cache','Cache-Control':'no-cache',  }
    url = 'https://www.nseindia.com/'
    step = s.get(url,headers=headers)
    today = datetime.now().strftime('%d-%m-%Y')
    api_url = f'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=01-01-2022&to_date=18-08-2022'
    resp = s.get(api_url,headers=headers).json()
    print("API Read")
    result = pd.DataFrame(resp)
    result.drop(['difference', 'dt','exchdisstime','csvName','old_new','orgid','seq_id','bflag','symbol','sort_date'], axis = 1, inplace = True)
    result.rename(columns = {'an_dt':'DateandTime', 'attchmntFile':'Source','attchmntText':'Topic','desc':'Type','smIndustry':'Sector','sm_name':'Company Name','sm_isin':'ISIN'}, inplace = True)
    result[['Date','Time']] = result.DateandTime.str.split(expand=True)
    result = result[result['Type'].str.contains("Loss of Share Certificates|Copy of Newspaper Publication") == False]
    result['Type'] = result['Type'].astype(str)
    result['Type'].replace("Certificate under SEBI (Depositories and Participants) Regulations, 2018",'Junk' , inplace = True)
    result = result[result['Type'].str.contains("Junk") == False]
    result = result[result["Type"].str.contains("Trading Window") == False]
    result = result[result["Type"].str.contains("Loss of share certificate") == False]
    result = result[result["Type"].str.contains("Loss of share certificates") == False]
    result = result[result["Type"].str.contains("Disclosure under SEBI Takeover Regulations") == False]
    result = result[result["Type"].str.contains("Newspaper Advertisements") == False]
    result = result[result["Type"].str.contains("-") == False]
    result.drop_duplicates(subset='Source', keep = 'first', inplace = True)
    result['Temporary']=pd.to_datetime(result['Date']+' '+result['Time'])
    result['Date']=result['Temporary'].dt.strftime('%b %d, %Y')
    result['Time']=result['Temporary'].dt.strftime('%R %p')
    result['DateTime'] = pd.to_datetime(result['Temporary'])
    result['DateTime'] = result['Temporary'].dt.strftime('%m/%d/%Y %I:%M %p')
    result.drop(['DateandTime', 'Temporary'], axis = 1, inplace = True)
    result = result.assign(Content = lambda x: ( urltotext(x['Source']) ))
    result.to_csv("2018-Test.csv")

all()

And here's how the sample rows and colums of the database i am using looks like

,Type,Source,Company Name,ISIN,Sector,Topic,Date,Time,DateTime,Equity,NSE
0,Updates,https://archives.nseindia.com/corporate/MOIL_01012019221502_Letter_SE_01012019_Change_Price_MnOre_160.pdf,MOIL Limited,INE490G01020,Metals,MOIL Limited has informed the Exchange regarding 'Fixation of prices of different grades Manganese Ore for 4th Quarter 2018-19 (January-Marchᅡメ2019) effective from 01.01.2019'.,"Jan 01, 2019",22:16 PM,01/01/2019 10:16 PM,yes,yes
1,Appointment,https://archives.nseindia.com/corporate/ICICIPRULI_01012019210925_SE_Intimation_appointment_of_director_01_01_2019_159.pdf,ICICI Prudential Life Insurance Company Limited,INE726G01019,,"ICICI Prudential Life Insurance Company Limited has informed the Exchange regarding Appointment of Ms Vibha Paul Rishi as Non- Executive Independent Director of the company w.e.f. January 01, 2019.","Jan 01, 2019",21:27 PM,01/01/2019 09:27 PM,yes,yes
2,Cessation,https://archives.nseindia.com/corporate/SUULD_01012019203402_IntimationLetter_158.pdf,Suumaya Industries Limited,INE591Q01016,,"Suumaya Lifestyle Limited has informed the Exchange regarding Cessation of Ms Priya Gandhi as Company Secretary & Compliance Officer of the company w.e.f. November 16, 2018.","Jan 01, 2019",20:35 PM,01/01/2019 08:35 PM,yes,yes
3,Updates,https://archives.nseindia.com/corporate/COALINDIA_01012019194358_01012019193112_156.pdf,Coal India Limited,INE522F01014,Mining,Coal India Limited has informed the Exchange regarding 'Provisional Production and offtake performance of CIL and its Subsidiary Companies for the month of Dec 18 and for the period Apr18 to Dec 18'.,"Jan 01, 2019",19:44 PM,01/01/2019 07:44 PM,yes,yes

1 Answers1

1

I believe that your problem is, that requests expects a string or Request object and you are providing a pd.Series object. You can either modify to use the values of the series like

result = result.assign(Content = lambda x: ( urltotext(x['Source'].values) )

or better, because of readability use the map function directly:

result["Content"] = result["Source"].map(urltotext)
Patrick H.
  • 168
  • 7
  • Will try this solution and let you know if it works or not . – Jay shankarpure Aug 18 '22 at 12:11
  • I have a question Patrick , this method is taking a lot of time for execution or its my code that is taking a lot of time for text extraction . Is their any way to code that if each row takes more than 3 Seconds to convert pdf to text than move on to the next row ? Thanks – Jay shankarpure Aug 18 '22 at 14:18
  • You need a separate thread for this. Check out this question [here](https://stackoverflow.com/questions/34562473/most-pythonic-way-to-kill-a-thread-after-some-period-of-time) I don't know how fast your internet connection is, but given that you are downloading pdfs from the same internet source I would guess that there is some kind of limit for web-scraping set up by the page provider. How many pdfs are you downloading? Maybe you can split the process to first download all and then process the pdfs? – Patrick H. Aug 18 '22 at 14:46
  • Well this dataset i am working with has around 50,000 Rows so i am downloading 50,000 PDFs and converting them to text . – Jay shankarpure Aug 18 '22 at 14:57