How can I load scraped page content to langchain VectorstoreIndexCreator

Question

I have a function which goes to url and crawls its content (+ from subpages). Then I want to load text content to langchain VectorstoreIndexCreator() . How can I do it via loader? I could not find any suitable loader in langchain.document_loaders. Should I use BaseLoader for it? How?

My code

import requests
from bs4 import BeautifulSoup

import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator


def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):

    # goes to url and get urls 
    links = get_links_from_page(company_url)

    # get_text_content_from_page goes to url and yields text, url tuple
    for text, url in get_text_content_from_page(links[:max_crawl_pages]): 
        # add text content (string) to index
        # loader????

    index= VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={"source": url})])

    # Finally, query the vector database:
    DEFAULT_QUERY = f"What does the company do? Who are key people in this company? Can you tell me contact information?"
    query = questions or DEFAULT_QUERY
    logger.info(f"Query: {query}")
    result = index.query_with_sources(query)

    logger.info(f"Result:\n {result['answer']}")
    logger.info(f"Sources:\n {result['sources']}")

    return result['answer'], result['sources']

Rijoanul Hasan Shanto · Answer 1 · 2023-06-19T12:03:13.580

Yes, you can use the WebBaseLoader which usages BeautifulSoup behind the scene to parse the data.

See the below sample:

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(your_url)
scrape_data = loader.load()

you can do multiple web pages by passing an array of URLs like below:

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()

And to load multiple web pages concurrently, you can use the aload() method.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # <-------- here

You may encounter some issues with loading concurrently if you already have a running asynio event loop which will throw an error something like "nested event loop error" or "RuntimeError: This event loop is already running" something like that. You can resolve this issue by using nest_asyncio library which is a patch to allow nested event loops. See the sample below:

import nest_asyncio

nest_asyncio.apply()

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()

Thanks. I tried to use it but I am getting `ValueError: Expected metadata value to be a str, int, or float, got None` when my code inside function body is `links = get_links_from_page(valid_url)` `loader = WebBaseLoader(links)` `index = VectorstoreIndexCreator().from_loaders([loader]) #here is the error` `# Finally, query the vector database:` `DEFAULT_QUERY = f"What does {company_name} do?" ` `query = questions or DEFAULT_QUERY` `result = index.query(query)` (not possible to better format in comments :/ ) — PetrSevcik, Jun 18 '23 at 20:39

How can I load scraped page content to langchain VectorstoreIndexCreator

1 Answers1