1/17/2024 0 Comments Rss reader windows crawl![]() favicon: URL: URL of feed or site Favicon.content_type: str: Content-Type value of the returned feed.content_length: int: Current length of the feed in bytes.bozo: int: Set to 1 when feed data is not well formed or may not be a feed.In addition to the url, FeedInfo objects may have the following values: Used in conjunction with the concurrency setting to avoid overloading sites. ![]() delay: float: (default 0.0): An optional argument to delay each HTTP request by the specified time in seconds.favicon_data_uri: bool: (default True): Optionally control whether to fetch found favicons and return them as a Data Uri.headers: dict: An optional dictionary of headers to pass to each HTTP request.max_depth: int: (default 10): An optional argument to limit the maximum depth of requests while following urls.max_content_length: int: (default 10Mb): An optional argument to specify the maximum size in bytes of each HTTP Response.user_agent: str: An optional argument to override the default User-Agent header.request_timeout: float: (default 3.0): An optional argument that controls how long before each individual HTTP request times out.total_timeout: float: (default 30.0): An optional argument to specify the time this function may run before timing out.concurrency: int: (default 10): An optional argument to specify the maximum number of concurrent HTTP requests.If no list is provided, but try_urls is True, then a list of common feed locations will be used. Takes the origins of the url parameter and appends the provided paths. try_urls: Union, bool]: (default False): An optional list of URL paths to query for feeds.If False, site metadata and favicon data may not be found. crawl_hosts: bool: (default True): An optional argument to add the site host origin URL to the list of initial crawl URLs.url: Union]: The initial URL or list of URLs at which to search for feeds. ![]() serialize (), favicon_data_uri : bool = True, delay : float = 0 ) The library is available on PyPI: pip install feedsearch-crawlerįeedsearch Crawler is called with the single function search: > from feedsearch_crawler import search > feeds = search ( '' ) > feeds > feeds. Pull requests and suggestions are welcome. It is a continuation of my work on Feedsearch, which is itself a continuation of the work done by Dan Foreman-Mackey on Feedfinder2, which in turn is based on feedfinder - originally written by Mark Pilgrim and subsequently maintained byįeedsearch Crawler differs with all of the above in that it is now built as an asynchronous Web crawler for Python 3.7 and above, using asyncio and aiohttp, to allow much more rapid scanning of possible feed urls.Īn implementation using this library to provide a public Feed Search API is available at Feedsearch Crawler is a Python library for searching websites for RSS, Atom, and JSON feeds.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |