0

I am trying to deploy a webscraper - python Selenium in a docker container to digital ocean which I can request.POST a url to from a different website. Using the console on digital ocean I can run the code and it works fine. So I believe the issue is in how I am receiving or posting the url to the webscraper. Currently it is returning a <Response [422]>

Here is the code, I pass the extract_text_via_scraper_service function a url in the form of a string e.g. "https://google.com", and the docker app should return the title in the form of a dictionary:

SCRAPER_API_TOKEN_HEADER=os.environ.get("SCRAPER_API_TOKEN_HEADER")
SCRAPER_API_ENDPOINT=os.environ.get("SCRAPER_API_ENDPOINT")
def extract_text_via_scraper_service(website): # website = url
    answer = {}
    if SCRAPER_API_ENDPOINT is None:
        return answer
    if SCRAPER_API_TOKEN_HEADER is None:
        return answer
    if website is None:
        return answer

    # send url through HTTP POST
    # return dict {}
    headers={
        "Authorization": f"Bearer {SCRAPER_API_TOKEN_HEADER}"
    }

    r = requests.post(SCRAPER_API_ENDPOINT, data=website, headers=headers)
    print(r)
    if r.status_code in range(200, 299):
        if r.headers.get("content-type") == 'application/json':
            answer = r.json()

   return answer

docker file:

import pathlib
import os
import io
from functools import lru_cache
from fastapi import (
    FastAPI,
    Header,
    HTTPException,
    Depends,
    Request,
    )
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException, NoSuchElementException, StaleElementReferenceException, TimeoutException, ElementClickInterceptedException
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
from pydantic import BaseSettings

class Settings(BaseSettings):
    app_auth_token: str
    debug: bool = False
    echo_active: bool = False
    app_auth_token_prod: str = None
    skip_auth: bool = False

    class Config:
        env_file = ".env"

@lru_cache
def get_settings():
    return Settings()

settings = get_settings()

DEBUG=settings.debug

BASE_DIR = pathlib.Path(__file__).parent
UPLOAD_DIR = BASE_DIR / "uploads"


app = FastAPI()
templates = Jinja2Templates(directory=str(BASE_DIR / "templates"))
# REST API

@app.get("/", response_class=HTMLResponse) # http GET -> JSON
def home_view(request: Request, settings:Settings = Depends(get_settings)):
    return templates.TemplateResponse("home.html", {"request": request, "abc": 123})

def verify_auth(authorization = Header(None), settings:Settings = Depends(get_settings)):
    if settings.debug and settings.skip_auth:
        return
    if authorization is None:
        raise HTTPException(detail="Invalid endpoint", status_code=401)
    label, token = authorization.split()
    if token != settings.app_auth_token:
        raise HTTPException(detail="Invalid endpoint", status_code=401)

@app.post("/") # http POST
async def prediction_view(website, authorization = Header(None), settings:Settings = Depends(get_settings)):
    verify_auth(authorization, settings)

    options = webdriver.ChromeOptions()
    options.headless = True
    options.add_argument("--headless")
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-gpu')

    driver = webdriver.Chrome("/usr/local/bin/chromedriver", options=options)

    wait = WebDriverWait(driver, 10)
    driver.get(website)

    
    title = "Sorry, we failed to get the correct name"
    

    #title
    try:
        title = driver.find_element(By.XPATH, "//title")
        title = title.get_attribute("innerText")
    except:
        pass
    
    print(title)
   
    return{"results": title, "original": title}

Any help appreciated.

Nick
  • 223
  • 2
  • 11
  • https://stackoverflow.com/questions/16133923/400-vs-422-response-to-post-of-data does this help? – grumpyp Nov 05 '21 at 08:16
  • It confirms that it is not an issue with authorisation. So I suspect the issue relates to how I am posting / receiving the url, it seems to not understand the datatype, but I just tried changing the docker parameter to website:str and it still returns the same issue unfortunately. – Nick Nov 05 '21 at 08:33
  • what does `r.text` return? – grumpyp Nov 05 '21 at 08:59
  • r.text returns `{"detail":[{"loc":["body"],"msg":"value is not a valid dict","type":"type_error.dict"}]}`. I am thinking I need to encode the url somehow first. – Nick Nov 05 '21 at 09:05
  • https://github.com/tiangolo/fastapi/issues/3373 there we go :) – grumpyp Nov 05 '21 at 09:17
  • @grumpyp - I have updated the receiving docker file to say `website: str` and the `request.POST` is now using `json={'website': website}`but now I am now getting the following from `print(r.text)` `{"detail":[{"loc":["query","website"],"msg":"field required","type":"value_error.missing"}]}` and from `print(r.json)` `>` – Nick Nov 05 '21 at 11:29

1 Answers1

0

The issue was here

old:

@app.post("/") # http POST
async def prediction_view(website, authorization = Header(None), settings:Settings = Depends(get_settings)):
    verify_auth(authorization, settings)

new:

class Item(BaseModel):
    website: str

@app.post("/") # http POST
async def prediction_view(requested_url: Item, authorization = Header(None), settings:Settings = Depends(get_settings)):

Ensure you are passing the data being posted as JSON. Relevant FastApi documentation can be found here, note the following: The function parameters will be recognized as follows:

  • If the parameter is also declared in the path, it will be used as a path parameter.

  • If the parameter is of a singular type (like int, float, str, bool, etc) it will be interpreted as a query parameter.

  • If the parameter is declared to be of the type of a Pydantic model,
    it will be interpreted as a request body.

Nick
  • 223
  • 2
  • 11