Go to top

Ideas

This is a list of ideas for student applications.

For more information about an idea, join the discussion in the corresponding GitHub issue.

Scrapy

Static Analysis Tooling

Easy
Description

While using Scrapy, there are certain common issues that are hard to detect. For example, a typo in the name of a setting.

Expected Result

Build a list of common issues in code using Scrapy that could be detected using static code analysis, and build a tool or extend an existing tool to detect those.

Required Skills Regular Expressions
Mentors Julio, Adrian
GitHub Issue #4421

Feed Exports Improvements

Intermediate
Description

While useful, the current implementation of feed exports can be lacking in some aspects.

Expected Result

Extend feed exports with new features, such as batch deliveries, split deliveries or multiple exporters.

Required Skills File Handling, Networking
Mentors Julio, Adrian
GitHub Issue #4250

MIME Sniffing Library

Intermediate
Description

HTTP responses should include a Content-Type header that indicates the MIME type of the response body. However, responses do not always include such a header, and sometimes they include it but the specified MIME type does not really match the response body.

Expected Result

Create a Python library that implements the complete MIME Sniffing Standard.

Stretch Goals

Integrate the resulting library into Scrapy.

Required Skills HTTP, Interface Design
Mentors Julio, Adrian
GitHub Issue #4240

General Message Queues as Storage for Requests

Intermediate
Description

It is common request to use external message queues as a storage for scrapy requests. There are several implementations exists. Not so long ago integration of non-disk and non-memory queues into scrapy required separate scheduler. Improvements for scrapy scheduler weren’t shared across these implementations. Adding different types of queues is much easier now. Such support requires less maintenance.

Expected Result

Create a Python library that connects scrapy with different message queues. Redis is a must have. Other are optional.

Required Skills Message Queues
Mentors Nikita, Adrian
GitHub Issue #4326

HTTP/2 Support

Advanced
Description

Implement HTTP/2 support to future-proof and possibly accelerate Scrapy.

Expected Result

An HTTP handler that can gracefully upgrade to HTTP/2 where possible, and take advantage of the compression and efficiency gains of the new protocol.

Required Skills HTTP, Twisted
Mentors Andrey, Adrian
GitHub Issue #1854

Make Scrapy Jupyter-friendly

Advanced
Description

Using Scrapy in Jupyter Notebook is not the most straightforward experience. Moreover, working with Scrapy could become a great experience there if we took advantage of the features that Jupyter can provide for user interaction.

Expected Result

Make it possible to develop Scrapy spiders interactively and visually inside Jupyter Notebook.

Required Skills JavaScript, HTML, Interface Design, Security
Mentors Andrey, Adrian
GitHub Issue #4299

Dateparser

Performance Optimizations

Intermediate
Description

We believe there is much room for improvement in the performance of Dateparser. Moreover, the current implementation is not thread-safe.

Expected Result

Profile and optimize the performance of the library.

Stretch Goals

Make the library thread-safe.

Required Skills Profiling, Algorithms, Data Structures, Multithreading
Mentors Marc, Kishan
GitHub Issue #624

Better Language Detection

Intermediate
Description

Currently language detection is rudimentary and often causes incorrect interpretation of dates.

Expected Result

Improve how language detection works. Plugging-in an optional language detection library is an option.

Required Skills Natural Language Processing
Mentors Marc, Kishan
GitHub Issue #612

Number Parser

Intermediate
Description

Sometimes date strings found in the internet include natural language numerals. For example, “Fifth of November”. We need to be able to parse such dates.

Expected Result

Create a Python library, or Python bindings for an existing library, that allow transforming natural language numerals into numbers, designed with multiple language support and performance in mind.

Stretch Goals

Integrate the resulting library or bindings into Dateparser and price-parser.

Required Skills Natural Language Processing
Mentors Marc, Kishan
GitHub Issue #46

Parsel

HTML5 Support

Easy
Description

When you inspect a website element in a web browser, you get a DOM-based HTML tree that is different from the actual, underlying HTML tree. This makes it difficult to translate what you find in a web browser into an XPath or CSS expression that can work in Parsel. More so when the underlying HTML is actually broken.

Expected Result

Extend Parsel to support different HTML parsers, and add support for additional HTML parsers.

Required Skills HTML, Interface Design
Mentors Andrey, Adrian
GitHub Issue #83

cssselect

CSS Selectors Level 4 Support

Advanced
Description

There is a W3C working draft for additional CSS selectors that adds many features

Expected Result

Extend cssselect to support all CSS Selectors Level 4 that can be translated into XPath 1.0.

Required Skills CSS, XPath 1.0, Syntax Parsing
Mentors Andrey, Adrian
GitHub Issue #108