URL

This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.

from langchain.document_loaders import UnstructuredURLLoader

API Reference:

UnstructuredURLLoader from langchain.document_loaders

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

Selenium URL Loader

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using selenium allows us to load pages that require JavaScript to render.

Setup

To use the SeleniumURLLoader, you will need to install selenium and unstructured.

from langchain.document_loaders import SeleniumURLLoader

API Reference:

SeleniumURLLoader from langchain.document_loaders

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

loader = SeleniumURLLoader(urls=urls)

data = loader.load()

Playwright URL Loader

This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader.

As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.

Setup

To use the PlaywrightURLLoader, you will need to install playwright and unstructured. Additionally, you will need to install the Playwright Chromium browser:

# Install playwright
pip install "playwright"
pip install "unstructured"
playwright install

from langchain.document_loaders import PlaywrightURLLoader

API Reference:

PlaywrightURLLoader from langchain.document_loaders

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])

data = loader.load()

URL

API Reference:

Selenium URL Loader

Setup​

API Reference:

Playwright URL Loader

Setup​

API Reference:

Setup

Setup