Go to top

Projects

At Scrapinghub we maintain and contribute to a wide variety of open source projects. See below for a list of projects we can mentor this year.

Scrapy

Very popular web crawling and scraping framework for Python used to write spiders for crawling and extracting data from websites.

Ideas Contribute

Splash

Headless-browser framework for web crawling and scraping, specifically designed to act as an accessory for Scrapy crawlers, though it can be used as a stand-alone tool as well.

Contribute

Scrapy-Splash

Scrapy plugin for transparent integration with Splash.

Contribute

ELI5

Python package which helps to debug machine learning classifiers and explain their predictions. It supports scikit-learn, xgboost, LightGBM, lightning, and sklearn-crfsuite out of the box, and it also supports black-box operation for explaining classifiers from outside this set.

Contribute

Dateparser

Python library to easily parse localized dates in almost any string format commonly found on web pages.

Ideas Contribute

Parsel

Python library to extract data from HTML and XML using XPath and CSS selectors.

Ideas Contribute

Extruct

Python library for extracting embedded metadata from HTML markup.

Contribute

w3lib

Python library of web-related functions.

Contribute

Spidermon

Python Quality-Assurance framework for Scrapy spiders that lets spider developers define and enforce rules for data schema and field coverage, and is extensible towards broader crawl-verification and data-validation needs.

Contribute

cssselect

Python library to translate CSS3 selectors into XPath 1.0 expressions.

Ideas Contribute

queuelib

Collection of persistent, disk-based queues for Python.

Contribute

price-parser

Python library for extracting price and currency from raw text strings.

Contribute

HTML to Text

Python library for extracting text from HTML.

Contribute