Inputs
Page object classes, in their __init__ method,
must define input parameters with type hints pointing to input classes.
Those input classes may be:
Other page object classes.
Item classes, when using a framework that can provide item classes.
Any other class that subclasses
Injectableor is registered or decorated withInjectable.register.
Based on the target URL and parameter type hints, frameworks automatically build the required objects at run time, and pass
them to the __init__ method of the corresponding page object class.
For example, if a page object class has an __init__ parameter of type
HttpResponse, and the target URL is
https://example.com, your framework would send an HTTP request to
https://example.com, download the response, build an
HttpResponse object with the response data,
and pass it to the __init__ method of the page object class being used.
Built-in input classes
Warning
Not all frameworks support all web-poet built-in input classes.
The web_poet.page_inputs module defines multiple classes that you can
define as inputs for a page object class, including:
HttpResponse, a complete HTTP response, including URL, headers, and body. This is the most common input for a page object class. See Working with HttpResponse.HttpClient, to send additional requests.RequestUrl, the target URL before following redirects. Useful, for example, to skip the target URL download, and instead useHttpClientto send a custom request based on parts of the target URL.PageParams, to receive data from the crawling code.Stats, to write key-value data pairs during parsing that you can inspect later, e.g. for debugging purposes.BrowserResponse, which includes URL, status code andBrowserHtmlof a rendered web page.Tip
You can use
BrowserPageinstead ofItemPageto haveBrowserResponseas input and get convenient shortcuts for working with it.AnyResponse, which either holdsBrowserResponseorHttpResponseas the.responseinstance, depending on which one is available or is more appropriate.
Working with HttpResponse
HttpResponse has many attributes and methods.
Tip
You can use WebPage instead of
ItemPage to have
HttpResponse as input and get
convenient shortcuts for working with it.
To get the entire response body, you can use body for
the raw bytes, text for the str
(decoded with the detected encoding), or json() to load a JSON response as a Python data structure:
>>> response.body
b'{"foo": "bar"}'
>>> response.text
'{"foo": "bar"}'
>>> response.json()
{'foo': 'bar'}
There are also methods to select content from responses: jmespath() for JSON and css() and
xpath() for HTML and XML:
>>> response.jmespath("foo")
[<Selector query='foo' data='bar'>]
>>> response.css("h1::text")
[<Selector query='descendant-or-self::h1/text()' data='Title'>]
>>> response.xpath("//h1/text()")
[<Selector query='//h1/text()' data='Title'>]
Working with BrowserResponse
BrowserResponse is similar to HttpResponse, but for
browser-rendered pages. In addition to the text
attribute, it has an html attribute containing
the rendered HTML (as a str) after JavaScript execution.
Like HttpResponse, it provides css()
and xpath() methods to select content from
the rendered page:
>>> response.html
'<html><head>...</head><body><h1>Title</h1>...</body></html>'
>>> response.css("h1::text")
[<Selector query='descendant-or-self::h1/text()' data='Title'>]
>>> response.xpath("//h1/text()")
[<Selector query='//h1/text()' data='Title'>]
Custom input classes
You may define your own input classes if you are using a framework that supports it.
However, note that custom input classes may make your page object classes less portable across frameworks.
Input annotations
A type hint that points to an input class can be annotated with
Annotated. For example:
from typing import Annotated
from web_poet.page_inputs.http import HttpResponse
from web_poet.pages import WebPage
class MyPage(WebPage):
def __init__(self, response: Annotated[HttpResponse, "my-metadata"]): ...
web-poet requires annotations to be JSON-serializable, for fixture
support. Because Annotated requires annotations to
be hashable, web-poet provides annotation_encode() to support
list and dict structures in annotations. For example:
from typing import Annotated
from web_poet import annotation_encode
from web_poet.page_inputs.http import HttpResponse
from web_poet.pages import WebPage
class MyPage(WebPage):
def __init__(
self, response: Annotated[HttpResponse, annotation_encode({"foo": ["bar"]})]
): ...