Metadata-Version: 2.1
Name: Protego
Version: 0.2.1
Summary: Pure-Python robots.txt parser with support for modern conventions
Home-page: https://github.com/scrapy/protego
Author: Anubhav Patel
Author-email: anubhavp28@gmail.com
License: BSD
Description: =======
        Protego
        =======
        
        .. image:: https://img.shields.io/pypi/pyversions/protego.svg
           :target: https://pypi.python.org/pypi/protego
           :alt: Supported Python Versions
        
        .. image:: https://img.shields.io/travis/scrapy/protego/master.svg
           :target: https://travis-ci.org/scrapy/protego
           :alt: Build Status
        
        Protego is a pure-Python ``robots.txt`` parser with support for modern
        conventions.
        
        
        Install
        =======
        
        To install Protego, simply use pip:
        
        .. code-block:: none
        
            pip install protego
        
        
        Usage
        =====
        
        >>> from protego import Protego
        >>> robotstxt = """
        ... User-agent: *
        ... Disallow: /
        ... Allow: /about
        ... Allow: /account
        ... Disallow: /account/contact$
        ... Disallow: /account/*/profile
        ... Crawl-delay: 4
        ... Request-rate: 10/1m                 # 10 requests every 1 minute
        ... 
        ... Sitemap: http://example.com/sitemap-index.xml
        ... Host: http://example.co.in
        ... """
        >>> rp = Protego.parse(robotstxt)
        >>> rp.can_fetch("http://example.com/profiles", "mybot")
        False
        >>> rp.can_fetch("http://example.com/about", "mybot")
        True
        >>> rp.can_fetch("http://example.com/account", "mybot")
        True
        >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
        False
        >>> rp.can_fetch("http://example.com/account/contact", "mybot")
        False
        >>> rp.crawl_delay("mybot")
        4.0
        >>> rp.request_rate("mybot")
        RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
        >>> list(rp.sitemaps)
        ['http://example.com/sitemap-index.xml']
        >>> rp.preferred_host
        'http://example.co.in'
        
        Using Protego with Requests_:
        
        >>> from protego import Protego
        >>> import requests
        >>> r = requests.get("https://google.com/robots.txt")
        >>> rp = Protego.parse(r.text)
        >>> rp.can_fetch("https://google.com/search", "mybot")
        False
        >>> rp.can_fetch("https://google.com/search/about", "mybot")
        True
        >>> list(rp.sitemaps)
        ['https://www.google.com/sitemap.xml']
        
        .. _Requests: https://3.python-requests.org/
        
        
        Comparison
        ==========
        
        The following table compares Protego to the most popular ``robots.txt`` parsers
        implemented in Python or featuring Python bindings:
        
        +----------------------------+---------+-----------------+--------+---------------------------+
        |                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |
        +============================+=========+=================+========+===========================+
        | Implementation language    | Python  | Python          | C++    | Python                    |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | `Length-based precedence`_ | ✓       |                 | ✓      |                           |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | Performance_               |         | +40%            | +1300% | -25%                      |
        +----------------------------+---------+-----------------+--------+---------------------------+
        
        .. _Google: https://developers.google.com/search/reference/robots_txt
        .. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
        .. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
        .. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
        .. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values
        
        
        API Reference
        =============
        
        Class ``protego.Protego``:
        
        Properties
        ----------
        
        *   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
            ``robots.txt``.
        
        *   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.
        
        
        Methods
        -------
        
        *   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
            ``protego.Protego``.
        
        *   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
            URL, otherwise return ``False``.
        
        *   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
            agent as a float. If nothing is specified, return ``None``.
        
        *   ``request_rate(user_agent)`` Return the request rate specified for the user
            agent as a named tuple ``RequestRate(requests, seconds, start_time,
            end_time)``. If nothing is specified, return ``None``.
        
Keywords: robots.txt,parser,robots,rep
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
Description-Content-Type: text/x-rst
