Welcome to parslepy’s documentation!

parslepy lets you extract content from HTML and XML documents where extraction rules are defined using a JSON object or equivalent Python dict, where keys are names you want to assign to target document sections, elements or attributes, and values are CSS3 Selectors or XPath 1.0 expressions matching these document parts.

By default,

  • selectors for elements will output their matching element(s)’ textual content. (children elements’ content is also included)
  • Selectors matching element attributes will output the attribute’s value.

You can nest objects, generate list of objects, and mix CSS and XPath – although not in the same selector.

Parslepy understands what lxml and cssselect understand, which is roughly CSS3 Selectors and XPath 1.0.

Each rule should have the following format:

output key (mandatory)
    |
  optionality operator (optional)
    |   |
    |   |  scope, always within brackets (optional)
    |   |      |
    v   v      v
"somekey?(someselector)":   "someCSSSelector"

or         //           :   "someXPathExpression"

or         //           :   ["someXPathOrCSSExpression"]

or         //           :    { ...some other rules... }

or         //           :    [{ ...some other rules... }]

And a collection of extraction rules –also called a parselet, or Parsley script– looks like this:

{
    "somekey": "#someID .someclass",                        # using a CSS selector
    "anotherkey": "//sometag[@someattribute='somevalue']",  # using an XPath expression
    "nestedkey(.somelistclass)": [{                         # CSS selector for multiple elements (scope selector)
        "somenestedkey": "somenestedtag/@someattribute"     # XPath expression for an attribbute
   }]
}

And the output would be something like:

{
    "somekey": "some value inside the first element matching the CSS selector",
    "anotherkey": "some value inside the first element matching the XPath expression",
    "nestedkey: [
        {"somenestedkey": "attribute of 1st nested element"},
        {"somenestedkey": "attribute of 2nd nested element"},
        ...
        {"somenestedkey": "attribute of last nested element"}
    ]
}

parslepy is a Python implementation – using lxml and cssselect – of the Parsley extraction language defined by Kyle Maxwell and Andrew Cantino (see parsley and parsley wiki for details and original C implementation).

Quickstart

Install

$ [sudo] pip install parslepy
  • using source code
$ git clone https://github.com/redapple/parslepy.git
$ [sudo] python setup.py install

Usage

Extract the main heading of the Python.org homepage, and the first paragraph below that:

>>> import parslepy
>>> rules = {"heading": "#content h1.pageheading", "summary": "#intro > p > strong"}
>>> parslepy.Parselet(rules).parse("http://www.python.org")
{'heading': u'Python Programming Language \u2013 Official Website', 'summary': u'Python is a programming language that lets you work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs.'}
>>>

Extract a page heading and a list of item links from a HTML page as a string:

>>> import lxml.etree
>>> import parslepy
>>> import pprint
>>> html = """
... <!DOCTYPE html>
... <html>
... <head>
...     <title>Sample document to test parslepy</title>
...     <meta http-equiv="content-type" content="text/html;charset=utf-8" />
... </head>
... <body>
... <h1 id="main">What&rsquo;s new</h1>
... <ul>
...     <li class="newsitem"><a href="/article-001.html">This is the first article</a></li>
...     <li class="newsitem"><a href="/article-002.html">A second report on something</a></li>
...     <li class="newsitem"><a href="/article-003.html">Python is great!</a> <span class="fresh">New!</span></li>
... </ul>
... </body>
... </html>"""
>>> rules = {
...      "heading": "h1#main",
...      "news(li.newsitem)": [{
...          "title": ".",
...          "url": "a/@href",
...          "fresh": ".fresh"
...      }],
... }
>>> p = parslepy.Parselet(rules)
>>> extracted = p.parse_fromstring(html)
>>> pprint.pprint(extracted)
{'heading': u'What\u2019s new',
 'news': [{'title': u'This is the first article', 'url': '/article-001.html'},
          {'title': u'A second report on something',
           'url': '/article-002.html'},
          {'fresh': u'New!',
           'title': u'Python is great! New!',
           'url': '/article-003.html'}]}

API

Parselet is the main class for extracting content from documents with parslepy.

Instantiate it with a Parsley script, containing a mapping of name keys, and selectors (CSS or XPath) to apply on documents, or document parts.

Then, run the extraction rules by passing an HTML or XML document to extract() or parse()

The output will be a dict containing the same keys as in your Parsley script, and, depending on your selectors, values will be:

  • text serialization of matching elements
  • element attributes
  • nested lists of extraction content
class parslepy.base.Parselet(parselet, selector_handler=None, strict=False, debug=False)[source]

Take a parselet and optional selector_handler and build an abstract representation of the Parsley extraction logic.

Two helper class methods can be used to instantiate a Parselet from JSON rules: from_jsonstring(), from_jsonfile().

Parameters:
  • parselet (dict) – Parsley script as a Python dict object
  • strict (boolean) – Set to True is you want to enforce that missing required keys raise an Exception; default is False (i.e. lenient/non-strict mode)
  • selector_handler – an instance of selectors.SelectorHandler optional selector handler instance; defaults to an instance of selectors.DefaultSelectorHandler
Raises:

InvalidKeySyntax

Example:

>>> import parslepy
>>> rules = {
...     "heading": "h1#main",
...     "news(li.newsitem)": [{
...         "title": ".",
...         "url": "a/@href"
...     }],
... }
>>> parslepy.Parselet(rules)
<parslepy.base.Parselet object at 0x164cfd0>

Use extract() or parse() to get extracted content from documents.

extract(document, context=None)[source]

Extract values as a dict object following the structure of the Parsley script (recursive)

Parameters:
  • document – lxml-parsed document
  • context – user-supplied context that will be passed to custom XPath extensions (as first argument)
Return type:

Python dict object with mapped extracted content

Raises:

NonMatchingNonOptionalKey

>>> import lxml.etree
>>> import parslepy
>>> html = '''
... <!DOCTYPE html>
... <html>
... <head>
...     <title>Sample document to test parslepy</title>
...     <meta http-equiv="content-type" content="text/html;charset=utf-8" />
... </head>
... <body>
... <h1 id="main">What&rsquo;s new</h1>
... <ul>
...     <li class="newsitem"><a href="/article-001.html">This is the first article</a></li>
...     <li class="newsitem"><a href="/article-002.html">A second report on something</a></li>
...     <li class="newsitem"><a href="/article-003.html">Python is great!</a> <span class="fresh">New!</span></li>
... </ul>
... </body>
... </html>
... '''
>>> html_parser = lxml.etree.HTMLParser()
>>> doc = lxml.etree.fromstring(html, parser=html_parser)
>>> doc
<Element html at 0x7f5fb1fce9b0>
>>> rules = {
...     "headingcss": "#main",
...     "headingxpath": "//h1[@id='main']"
... }
>>> p = parslepy.Parselet(rules)
>>> p.extract(doc)
{'headingcss': u'What’s new', 'headingxpath': u'What’s new'}
classmethod from_jsonfile(fp, selector_handler=None, strict=False, debug=False)[source]

Create a Parselet instance from a file containing the Parsley script as a JSON object

>>> import parslepy
>>> with open('parselet.json') as fp:
...     parslepy.Parselet.from_jsonfile(fp)
...
<parslepy.base.Parselet object at 0x2014e50>
Parameters:fp (file) – an open file-like pointer containing the Parsley script
Return type:Parselet

Other arguments: same as for Parselet contructor

classmethod from_jsonstring(s, selector_handler=None, strict=False, debug=False)[source]

Create a Parselet instance from s (str) containing the Parsley script as JSON

>>> import parslepy
>>> parsley_string = '{ "title": "h1", "link": "a @href"}'
>>> parslepy.Parselet.from_jsonstring(parsley_string)
<parslepy.base.Parselet object at 0x183a050>
>>>
Parameters:s (string) – a Parsley script as a JSON string
Return type:Parselet

Other arguments: same as for Parselet contructor

parse(fp, parser=None, context=None)[source]

Parse an HTML or XML document and return the extacted object following the Parsley rules give at instantiation.

Parameters:
  • fp – file-like object containing an HTML or XML document, or URL or filename
  • parserlxml.etree._FeedParser instance (optional); defaults to lxml.etree.HTMLParser()
  • context – user-supplied context that will be passed to custom XPath extensions (as first argument)
Return type:

Python dict object with mapped extracted content

Raises:

NonMatchingNonOptionalKey

To parse from a string, use the parse_fromstring() method instead.

Note that the fp paramater is passed directly to lxml.etree.parse, so you can also give it an URL, and lxml will download it for you. (Also see http://lxml.de/tutorial.html#the-parse-function.)

parse_fromstring(s, parser=None, context=None)[source]

Parse an HTML or XML document and return the extacted object following the Parsley rules give at instantiation.

Parameters:
  • s (string) – an HTML or XML document as a string
  • parserlxml.etree._FeedParser instance (optional); defaults to lxml.etree.HTMLParser()
  • context – user-supplied context that will be passed to custom XPath extensions (as first argument)
Return type:

Python dict object with mapped extracted content

Raises:

NonMatchingNonOptionalKey

Customizing

You can use a Parselet directly with it’s default configuration, which should work fine for HTML documents when the content you want to extract can be accessed by regular CSS3 selectors or XPath 1.0 expressions.

But you can also customize how selectors are interpreted by sub-classing SelectorHandler and passing an instance of your selector handler to the Parselet constructor.

class parslepy.selectors.Selector(selector)[source]

Class of objects returned by SelectorHandler instances’ (and subclasses) make() method.

class parslepy.selectors.SelectorHandler(debug=False)[source]

Called when building abstract Parsley trees and when etracting object values during the actual parsing of documents

This should be subclassed to implement the selector processing logic you need for your Parsley handling.

All 3 methods, make(), select() and extract() MUST be overridden

extract(document, selector)[source]

Apply the selector on the document and return a value for the matching elements (text content or element attributes)

Parameters:
  • document – lxml-parsed document
  • selector – input Selector to apply on the document
Return type:

depends on the selector (string, boolean value, ...)

Return value can be single- or multi-valued.

make(selection_string)[source]

Interpret a selection_string as a selector for elements or element attributes in a (semi-)structured document. In case of XPath selectors, this can also be a function call.

Parameters:selection_string – a string representing a selector
Return type:Selector
select(document, selector)[source]

Apply the selector on the document

Parameters:
  • document – lxml-parsed document
  • selector – input Selector to apply on the document
Return type:

lxml.etree.Element list

class parslepy.selectors.XPathSelectorHandler(namespaces=None, extensions=None, context=None, debug=False)[source]

This selector only accepts XPath selectors.

It understands what lxml.etree.XPath understands, that is XPath 1.0 expressions

Parameters:
  • namespaces – namespace mapping as dict
  • extensions – extension dict
  • context – user-context passed to XPath extension functions

namespaces and extensions dicts should have the same format as for lxml: see http://lxml.de/xpathxslt.html#namespaces-and-prefixes and http://lxml.de/extensions.html#xpath-extension-functions

Extension functions have a slightly different signature than pure-lxml extension functions: they must expect a user-context as first argument; all other arguments are the same as for lxml extensions.

context will be passed as first argument to extension functions registered through extensions. Alternative: user-context can also be passed to parslepy.base.Parselet.parse()

class parslepy.selectors.DefaultSelectorHandler(namespaces=None, extensions=None, context=None, debug=False)[source]

Default selector logic, loosely based on the original Parsley implementation.

This handler understands what cssselect and lxml.etree.XPath understands, that is (roughly) XPath 1.0 and CSS3 for things that dont need browser context

Parameters:
  • namespaces – namespace mapping as dict
  • extensions – extension dict
  • context – user-context passed to XPath extension functions

namespaces and extensions dicts should have the same format as for lxml: see http://lxml.de/xpathxslt.html#namespaces-and-prefixes and http://lxml.de/extensions.html#xpath-extension-functions

Extension functions have a slightly different signature than pure-lxml extension functions: they must expect a user-context as first argument; all other arguments are the same as for lxml extensions.

context will be passed as first argument to extension functions registered through extensions. Alternative: user-context can also be passed to parslepy.base.Parselet.parse()

Example with iTunes RSS feed:

>>> import lxml.etree
>>> xml_parser = lxml.etree.XMLParser()
>>> url = 'http://itunes.apple.com/us/rss/topalbums/limit=10/explicit=true/xml'
>>>
>>> # register Atom and iTunes namespaces with prefixes "atom" and "im"
... # with a custom SelectorHandler
... xsh = parslepy.XPathSelectorHandler(
...     namespaces={
...         'atom': 'http://www.w3.org/2005/Atom',
...         'im': 'http://itunes.apple.com/rss'
...     })
>>>
>>> # use prefixes to target elements in the XML document
>>> rules = {
...     "entries(//atom:feed/atom:entry)": [
...         {
...             "title": "atom:title",
...             "name": "im:name",
...             "id": "atom:id/@im:id",
...             "artist(im:artist)": {
...                 "name": ".",
...                 "href": "@href"
...             },
...             "images(im:image)": [{
...                 "height": "@height",
...                 "url": "."
...             }],
...             "releasedate": "im:releaseDate"
...         }
...     ]
... }
>>> parselet = parslepy.Parselet(rules, selector_handler=xsh)
>>> parselet.parse(url, parser=xml_parser)
{'entries': [{'name': u'Born Sinner (Deluxe Version)', ...

Exceptions

exception parslepy.base.InvalidKeySyntax[source]

Raised when the input Parsley script’s syntax is invalid

>>> import parslepy
>>> p = parslepy.Parselet({"heading@": "#main"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parslepy/base.py", line 325, in __init__
    self.compile()
  File "parslepy/base.py", line 393, in compile
    self.parselet_tree = self._compile(self.parselet)
  File "parslepy/base.py", line 432, in _compile
    raise InvalidKeySyntax("Key %s is not valid" % k)
InvalidKeySyntax: Key heading@ is not valid
exception parslepy.base.NonMatchingNonOptionalKey[source]

Raised by a Parselet instance while extracting content in strict mode, when a required key does not yield any content.

>>> import parslepy
>>> html = '''
... <!DOCTYPE html>
... <html>
... <head>
...     <title>Sample document to test parslepy</title>
...     <meta http-equiv="content-type" content="text/html;charset=utf-8" />
... </head>
... <body>
... <h1 id="main">What&rsquo;s new</h1>
... <ul>
...     <li class="newsitem"><a href="/article-001.html">This is the first article</a></li>
...     <li class="newsitem"><a href="/article-002.html">A second report on something</a></li>
...     <li class="newsitem"><a href="/article-003.html">Python is great!</a> <span class="fresh">New!</span></li>
... </ul>
... </body>
... </html>
... '''
>>> rules = {
...     "heading1": "h1#main",
...     "heading2": "h2#main",
... }
>>> p = parslepy.Parselet(rules, strict=True)
>>> p.parse_fromstring(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parslepy/base.py", line 501, in extract
    return self._extract(self.parselet_tree, document)
  File "parslepy/base.py", line 582, in _extract
    document.getroottree().getpath(document),v
NonMatchingNonOptionalKey: key "heading2" is required but yield nothing
Current path: /html/(<Selector: inner=<CSSSelector 20a2758 for 'h2#main'>>)

Extension functions

parslepy extends XPath 1.0 functions through lxml‘s XPath extensions. See http://lxml.de/extensions.html for details.

Built-in extensions

parslepy comes with a few XPath extension functions. These functions are available by default when you use XPathSelectorHandler or DefaultSelectorHandler.

  • parslepy:text(xpath_expression[, false()]): returns the text content for elements matching xpath_expression. The optional boolean second parameter indicates whether tail content should be included or not. (Internally, this calls lxml.etree.tostring(..., method=”text”, encoding=unicode).) Use true() and false() XPath functions, not only true or false, (or 1 or 0). Defaults to true().

    >>> import parslepy
    >>> doc = """<!DOCTYPE html>
    ... <html>
    ... <head>
    ...     <title>Some page title</title>
    ... </head>
    ...
    ... <body>
    ...     <h1>Some heading</h1>
    ...
    ...     Some text
    ...
    ...     <p>
    ...     Some paragraph
    ...     </p>
    ... </body>
    ...
    ... </html>"""
    >>> rules = {"heading": "h1"}
    >>>
    >>> # default text extraction includes tail text
    ... parslepy.Parselet(rules).parse_fromstring(doc)
    {'heading': u'Some heading Some text'}
    >>>
    >>> # 2nd argument to false means without tail text
    ... rules = {"heading": "parslepy:text(//h1, false())"}
    >>> parslepy.Parselet(rules).parse_fromstring(doc)
    {'heading': 'Some heading'}
    >>>
    >>> # 2nd argument to true is equivalent to default text extraction
    >>> rules = {"heading": "parslepy:text(//h1, true())"}
    >>> parslepy.Parselet(rules).parse_fromstring(doc)
    {'heading': 'Some heading Some text'}
    >>>
    

    See http://lxml.de/tutorial.html#elements-contain-text for details on how lxml handles text and tail element properties

  • parslepy:textnl(xpath_expression): similar to parslepy:text() but appends \n characters to HTML block elements such as <br>, <hr>, <div>

    >>> import parslepy
    >>> doc = """<!DOCTYPE html>
    ... <html>
    ... <head>
    ...     <title>Some page title</title>
    ... </head>
    ... <body>
    ... <h1>Some heading</h1><p>Some paragraph<div>with some span inside</div>ending now.</p>
    ... </body>
    ... </html>
    ... """
    >>> parslepy.Parselet({"heading": "parslepy:text(//body)"}).parse_fromstring(doc)
    {'heading': 'Some headingSome paragraphwith some span insideending now.'}
    >>>
    >>> parslepy.Parselet({"heading": "parslepy:textnl(//body)"}).parse_fromstring(doc)
    {'heading': 'Some heading\nSome paragraph\nwith some span inside\nending now.'}
    >>>
    
  • parslepy:html(xpath_expression) returns the HTML content for elements matching xpath_expression. Internally, this calls lxml.html.tostring(element).

    >>> import parslepy
    >>> doc = """<!DOCTYPE html>
    ... <html>
    ... <head>
    ...     <title>Some page title</title>
    ... </head>
    ... <body>
    ... <h1>(Some heading)</h1>
    ... <h2>[some sub-heading]</h2>
    ... </body>
    ... </html>
    ... """
    >>> parslepy.Parselet({"heading": "parslepy:html(//h1)"}).parse_fromstring(doc)
    {'heading': '<h1>(Some heading)</h1>'}
    >>> parslepy.Parselet({"heading": "parslepy:html(//body)"}).parse_fromstring(doc)
    {'heading': '<body>\n<h1>(Some heading)</h1>\n<h2>[some sub-heading]</h2>\n</body>'}
    >>>
    
  • parslepy:xml(xpath_expression) returns the XML content for elements matching xpath_expression. Internally, this calls lxml.etree.tostring(element).

  • parslepy:strip(xpath_expression[, chars]) behaves like Python’s strip() method for strings but for the text content of elements matching xpath_expression. See http://docs.python.org/2/library/string.html#string.strip

    >>> import parslepy
    >>> doc = """<!DOCTYPE html>
    ... <html>
    ... <head>
    ...     <title>Some page title</title>
    ... </head>
    ... <body>
    ... <h1>(Some heading)</h1>
    ... <h2>[some sub-heading]</h2>
    ... </body>
    ... </html>
    ... """
    >>> parslepy.Parselet({"heading": "parslepy:strip(//h2, '[')"}).parse_fromstring(doc)
    {'heading': 'some sub-heading]'}
    >>> parslepy.Parselet({"heading": "parslepy:strip(//h2, ']')"}).parse_fromstring(doc)
    {'heading': '[some sub-heading'}
    >>> parslepy.Parselet({"heading": "parslepy:strip(//h2, '[]')"}).parse_fromstring(doc)
    {'heading': 'some sub-heading'}
    >>> parslepy.Parselet({"heading": "parslepy:strip(//h1, '()')"}).parse_fromstring(doc)
    {'heading': 'Some heading'}
    >>>
    
  • parslepy:attrname(xpath_expression_matching_attribute) returns name of an attribute. This works with the catch-all-attributes @* expression or a specific attribute expression like @class. It may sound like a useless extension but it can be useful when combined with the simple @* selector like in the example below:

    >>> img_attributes = {
    ...     "images(img)": [{
    ...         "attr_names": ["parslepy:attrname(@*)"],
    ...         "attr_vals": ["@*"],
    ...     }]
    ... }
    >>> extracted = parslepy.Parselet(img_attributes).parse('http://www.python.org')
    >>> for r in extracted["images"]:
    ...:     print dict(zip(r.get("attr_names"), r.get("attr_vals")))
    ...:
    {'src': '/images/python-logo.gif', 'alt': 'homepage', 'border': '0', 'id': 'logo'}
    {'src': '/images/trans.gif', 'alt': 'skip to navigation', 'border': '0', 'id': 'skiptonav'}
    {'src': '/images/trans.gif', 'alt': 'skip to content', 'border': '0', 'id': 'skiptocontent'}
    {'width': '116', 'alt': '', 'src': '/images/donate.png', 'title': '', 'height': '42'}
    {'width': '94', 'style': 'align:center', 'src': '/images/worldmap.jpg', 'alt': '[Python resources in languages other than English]', 'height': '46'}
    {'src': '/images/success/Carmanah.png', 'alt': 'success story photo', 'class': 'success'}
    

User-defined extensions

parslepy also lets you define your own XPath extensions, just like lxml does, except the function you register must accept a user-supplied context object passed as first argument, subsequent arguments to your extension function will be the same as for lxml extensions, i.e. an XPath context, followed by matching elements and whatever additional parameters your XPath call passes.

The user-supplied context should be passed to parslepy.base.Parselet.parse(), or globally to a XPathSelectorHandler subclass instance passed to instantiate a Parselet.

Let’s illustrate this with a custom extension to make <img> @src attributes “absolute”.

Suppose we already have an extraction rule that outputs the @src attributes from <img> tags on the Python.org homepage:

>>> import parslepy
>>> import pprint
>>> parselet = parslepy.Parselet({"img_abslinks": ["//img/@src"]})
>>> pprint.pprint(parselet.parse('http://www.python.org'))
{'img_abslinks': ['/images/python-logo.gif',
                  '/images/trans.gif',
                  '/images/trans.gif',
                  '/images/donate.png',
                  '/images/worldmap.jpg',
                  '/images/success/afnic.fr.png']}

We now want to generate full URLs for these images, not relative to http://www.python.org.

First we need to define our extension function as a Python function: parslepy’s extension functions must accept a user-context as first argument, then should expect an XPath context, followed by elements or strings matching the XPath expression, and finally whatever other parameters are passed to the function call in extraction rules.

In our example, we expect @src attribute values as input from XPath, and combine them with a base URL (via urlparse.urljoin()), the URL from which the HTML document was fetched. The base URL will be passed as user-context, and we will receive it as first argument. So the Python extension function may look like this:

>>> import urlparse
>>> def absurl(ctx, xpctx, attributes, *args):
...         # user-context "ctx" will be the URL of the page
...         return [urlparse.urljoin(ctx, u) for u in attributes]
...

Then, we need to register this function with parslepy through a custom selector handler, with a custom namespace and its prefix:

>>> # choose a prefix and namespace, e.g. "myext" and "local-extensions"
... mynamespaces = {
...         "myext": "local-extensions"
...     }
>>> myextensions = {
...         ("local-extensions", "absurl"): absurl,
...     }
>>>
>>> import parslepy
>>> sh = parslepy.DefaultSelectorHandler(
...         namespaces=mynamespaces,
...         extensions=myextensions)
>>>

Now we can use this absurl() XPath extension within parslepy rules, with the “myext” prefix (do not forget to pass your selector handler to your Parselet instance):

>>> rules = {"img_abslinks": ["myext:absurl(//img/@src)"]}
>>> parselet = parslepy.Parselet(rules, selector_handler=sh)

And finally, run the extraction rules on Python.org’s homepage again, with a context argument set to the URL

>>> import pprint
>>> pprint.pprint(parselet.parse('http://www.python.org',
...         context='http://www.python.org'))
{'img_abslinks': ['http://www.python.org/images/python-logo.gif',
                  'http://www.python.org/images/trans.gif',
                  'http://www.python.org/images/trans.gif',
                  'http://www.python.org/images/donate.png',
                  'http://www.python.org/images/worldmap.jpg',
                  'http://www.python.org/images/success/afnic.fr.png']}
>>>

In this case, it may feel odd to have to pass the URL twice, but parse(URL) does not store the URL anywhere, it processes only the HTML stream from the page.

Changelog

Version 0.2.0 - August 5., 2013

  • Improvements:
    • Support XPath namespace prefixes (namespace:element) and CSS namespace prefixes (namespace|element) in DefaultSelectorHandler
    • new built-in extension function parslepy:strip() mapped to Python’s strip() for strings
    • new built-in extension function parslepy:attrname() that takes an attribute selector and returns the attribute’s name
    • support for user-defined extension functions, which take an additional context parameter when called (context is passed either at selector handler instantiation or when calling Parselet.parse())
    • use smart_strings=False for XPath compiled expressions, except for user-defined extensions and some built-in extensions (see http://lxml.de/xpathxslt.html#xpath-return-values)
  • Bug fixes:
    • #2: XPath namespace prefixes raise cssselect.xpath.ExpressionError with DefaultSelectorHandler
    • #3: Docs suggest using *.js files when they are JSON documents
    • #4: The example usage should not have both url_css and url_xpath
    • #5: In example usage, skip lines between “configuration” and “execution”
    • #6: add underscore to _version__
    • #7: Empty result set on boolean or numerical selectors

Version 0.1.2 - July 9, 2013

  • Bug fixes:
    • #1: headingxpath rule does not seem to work as expected

Version 0.1.1 - July 3, 2013

  • Docstrings added to main classes and methods.
  • Added parse_fromstring() method to Parselet
  • Added tests for Parselet.parse() and Parselet.parse_fromstring()

Version 0.1 - June 30, 2013

Initial release