0

I am new to python and looking to implement Luigi into some of my python data processing scripts. I have two tasks, one task will web scrape some data and create a csv. The next task (dependent on the 1st tasks csv file) will run a sql server proc to dump the csv data into the database. When I run these tasks individually they work fine. But when I add a requires it gives me the error you can see in the title.

Please can you let me know what I am doing wrong?

The Luigi full error is as follows:

Runtime error: Traceback (most recent call last): File "C:\Users\somepath\Luigi\venv\python\lib\site-packages\luigi\worker.py", line 182, in run raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing))) RuntimeError: Unfulfilled dependency at run time: Generate_TV_WebScraping_File_X__DataMining_Lu_8213e479cf

Apologies for the code sample below in terms of indentations etc. The formatting has been changed on pasting.

My current code is below:

import requests
from bs4 import BeautifulSoup
import re
import pyodbc
import luigi
import arrow

class Generate_TV_WebScraping_File(luigi.ExternalTask):

    input_path = luigi.Parameter('X:/somefilepath/Televised_Football_Staging.csv')

    def ouptut(self):
        return luigi.LocalTarget(self.input_path)

    def run(self):
        ################################ GET DATA FROM WEBSITE ###############################################

        ## set url
        page_link = 'https://www.somewebsite.html'

        ## request access with timout of 5 seconds
        page_response = requests.get(page_link, timeout=5)

        ## BS to parse the html
        page_content = BeautifulSoup(page_response.content, "html.parser")

        ## find all content related to match fixtures div class
        div_team = page_content.findAll('div', attrs={"class":"span4 matchfixture"})

        clean_date = ''

        ## set path and file for data export
        f = open("X:\somefilepath\Televised_Football_Staging.csv", "w")

        ## for all the content in div class 'row-fluid'
        for rows in page_content.findAll('div', attrs={"class":"row-fluid"}):
        ## if the content div class is match date
            if rows.findAll('div', attrs={"class": "span12 matchdate"}):
        ## save it to the variable 'date_row'
              date_row = rows.findAll('div', attrs={"class": "span12 matchdate"})
        ## clean it by removing html tags and comma separate
              concat_rows = ",".join(str(x) for x in date_row)
              clean_date = re.sub("<.*?>", " ", concat_rows)
        ## when it is not a match date in the div class 'row-fluid' and it is the match fixture content
            elif rows.findAll('div', attrs={"class": "span4 matchfixture"}):
        ## clean it by removing html tags and comma separate
                concat_rows = ",".join(str(x) for x in rows)
                clean_rows = re.sub("<.*?>", " ", concat_rows)
        ## print the content and concatenate with date
                f.write('%s\n' % (clean_rows + "," + clean_date))

        ## Close csv
        f.close()

        #######################################################################################################


class Insert_TV_WebScraping_To_Db(luigi.Task):

    def requires(self):
        return Generate_TV_WebScraping_File(self.input_path)

    def ouptut(self):
        sys_date = arrow.now().format('YYYYMMDD')
        return luigi.LocalTarget('X:/somefilepath/tv_webscrape_log_' + sys_date + '.txt')

    def run(self):
        ############################### INSERT DATA INTO DATABASE ###################################################

        ## set sql connection string to DataMiningDev
        cnxn = pyodbc.connect(driver="{SQL Server}", server="someserver", database="somedatabase", autocommit=True)

        ## run sql query
        cursor = cnxn.cursor()
        cursor.execute('EXEC somedatabase.someschema.somedbproc')

        ## being kind
        cnxn.close()

        #############################################################################################################


# Run Luigi Tasks #
#luigi.run(main_task_cls=Generate_TV_WebScraping_File)
luigi.run(main_task_cls=Insert_TV_WebScraping_To_Db)
DB-93
  • 13
  • 4
  • Please format your code better, paste the entire Luigi log separately. Also use a `with` statement to read and write to files. Also you should not be calling the param `input_path` in both tasks. The output of the first task becomes the input of the second. Also, you are using too many literal constants. Define the default as the global var and reuse it. Also it is a good idea to grab the path using `self.output().path` and `self.output().path` where appropriate. Please make the posted example simpler. Fake/cut out the DB operations just to see how one task connects to another. – Leonid Mar 19 '19 at 15:39
  • Hello, Thanks for your response and criticism, it'll help me a lot! I've updated the code formatting so it's easier to read. I've also pasted the full Luigi console error output. I will tidy up the literal constants within the code once it's working fully, but thanks for spotting and your suggestion. You can ignore the def run(self): operation within the 2nd task, all it is doing is executing a sql server procedure. But the actual task itself is now being executed due to the Luigi error/dependency on the 1st task, this is what i cannot figure out ! Any ideas? – DB-93 Mar 19 '19 at 16:26
  • There is still a problem - other people cannot run this code without modifying it. For example, `return Generate_TV_WebScraping_File(self.input_path)` refers to `self.input_path` which does not exist. You posted code that is neither self-contaned, nor the final version that you are working with. The file path is tied to Windows and not all of us are using it and neither do we have to. It would be better to use local paths for the code sample that you posted. – Leonid Mar 19 '19 at 16:36
  • You may find the detailed steps in this answer helpful. https://stackoverflow.com/a/43326656/1317713 One important thing to look for is - is the file that you expect to be created actually created? If not, then Luigi will think that the dependent task did not succeed. – Leonid Mar 19 '19 at 17:15
  • @DB-93 Did you solve it? In case you haven't try changing `f =open("X:\somefilepath\Televised_Football_Staging.csv", "w")` for a call to the output function to make sure that file gets generated – KansaiRobot Sep 18 '20 at 09:20

0 Answers0