Requirements Generator — Design
A processor that scans Python source files and produces a requirements.txt
listing the third-party distributions the project imports. Fills the gap
between the Python analyzer (which discovers local dep edges) and the pip
processor (which consumes requirements.txt).
Problem
Users have Python projects with import statements. They want the set of
PyPI distributions their code needs, written out to requirements.txt.
Today they maintain this file by hand, which drifts from the actual imports.
Shape
A whole-project Generator processor named requirements:
- Inputs: every
.pyfile in the project (same scan as the Python analyzer —file_index.scan(&self.config.standard, true)). - Output: a single
requirements.txt(path configurable). - Discovery: one
Productwith all.pyfiles as inputs, one output path. Structurally identical to thetagsprocessor.
The classification problem
Every import X lands in one of three buckets:
- Local — a module that resolves to a file in the project. Skip.
- Stdlib — a module shipped with Python (
os,sys,json, …). Skip. - Third-party — a PyPI distribution. Emit to
requirements.txt.
The Python analyzer already resolves bucket 1 via
PythonDepAnalyzer::resolve_module. The new processor needs buckets 2 and 3.
Stdlib detection
Python 3.10+ ships sys.stdlib_module_names — a frozenset of every stdlib
top-level module name. We bake this list into a static table
(src/processors/generators/python_stdlib.rs) rather than probing python3
at build time. Reasons:
- The list is stable across 3.10+ with a handful of additions per minor release.
- No tool dependency at build time — keeps the processor offline and hermetic.
- The list is ~300 names, a few KB of source.
A refresh script regenerates the table from python3 -c 'import sys; print(sorted(sys.stdlib_module_names))' when we bump Python support. The
list lives alongside the processor, not in a user-facing config.
Import → distribution mapping
The import name is not always the PyPI distribution name:
| Import | Distribution |
|---|---|
cv2 | opencv-python |
yaml | PyYAML |
PIL | Pillow |
sklearn | scikit-learn |
bs4 | beautifulsoup4 |
We bake a curated table of the common ~40 mismatches into the processor and
default everything else to identity (import X → distribution X). Users
override via config:
[processor.requirements.mapping]
cv2 = "opencv-python"
custom_internal = "our-private-dist"
User entries win over the built-in table. This is lossy by design — we accept that unusual packages need a config entry — in exchange for:
- No dependency on an installed Python environment.
requirements.txtgeneration works on a clean checkout (no chicken-and-egg withpip install).- Deterministic output regardless of the caller’s environment.
The alternative — probing importlib.metadata.packages_distributions() —
is more accurate but requires packages to already be installed. Rejected
for now; can be added later as an opt-in resolve = "probe" mode if users
hit the mapping ceiling.
Configuration
[processor.requirements]
output = "requirements.txt" # Output file path
exclude = [] # Import names to never emit (e.g. internal vendored modules)
sorted = true # Sort output alphabetically (vs. discovery order)
header = true # Emit a "# Generated by rsconstruct" header line
[processor.requirements.mapping]
cv2 = "opencv-python" # User-provided import → distribution overrides
| Key | Type | Default | Description |
|---|---|---|---|
output | string | "requirements.txt" | Output file path |
exclude | string[] | [] | Import names to never emit |
sorted | bool | true | Sort entries alphabetically |
header | bool | true | Include a comment header line |
mapping | map | {} | Per-project import→distribution overrides |
Pinning (pkg==1.2.3) is deferred. The first iteration emits bare names.
Adding pinning later means probing pip show or parsing a lockfile —
separate concern.
Code organization
Shared import scanner
Factor the regex scanning out of src/analyzers/python.rs into a module
function shared by the analyzer and the generator:
#![allow(unused)]
fn main() {
// src/analyzers/python.rs
pub(crate) fn scan_python_imports(path: &Path) -> Result<Vec<String>> { ... }
}
Returns the raw top-level module names found in import X and from X import ... lines. The analyzer then runs this through resolve_module to
keep local ones; the generator runs it through the stdlib table and
mapping to produce the final list.
This fixes architecture-observations #6 (analyzers can’t hand data to processors) at the scope of this one feature: instead of building a cross-processor channel, we share a pure function.
Files
src/processors/generators/requirements.rs— the processor, ~150 lines.src/processors/generators/python_stdlib.rs— the stdlib names table (static&[&str]) and ais_stdlib(module: &str) -> boolhelper.src/processors/generators/distribution_map.rs— the curated import→distribution mapping, aresolve_distribution(import: &str) -> &strhelper that falls through to identity.src/config/processor_configs.rs— addRequirementsConfig.src/processors/mod.rs— addpub const REQUIREMENTS = "requirements"tonamesmodule.docs/src/processors/requirements.md— user-facing processor doc.
Processor structure
Mirrors tags (whole-project generator with one output):
#![allow(unused)]
fn main() {
pub struct RequirementsProcessor {
base: ProcessorBase,
config: RequirementsConfig,
}
impl Processor for RequirementsProcessor {
fn discover(&self, graph, file_index, instance_name) -> Result<()> {
// Scan for .py files; if none, no product.
// Add one product: inputs=all .py files, outputs=[output_path].
}
fn supports_batch(&self) -> bool { false }
fn execute(&self, _ctx, product) -> Result<()> {
// 1. Scan each input .py for imports.
// 2. For each top-level module name:
// - Skip if local (resolves to a project file).
// - Skip if stdlib.
// - Skip if in user's `exclude`.
// - Map import → distribution name.
// 3. Dedupe, sort if configured, write to output.
}
}
}
Cache behavior
Falls naturally out of the descriptor-based cache:
- Inputs: every
.pyfile + config hash. - Output:
requirements.txt. - Adding/removing an import changes file contents, triggers rebuild.
- Changing config (new mapping entry, new exclude) changes config hash, triggers rebuild.
- Code changes inside a function that don’t affect imports still trigger a rebuild, since we can’t cheaply know which lines matter. Acceptable — the regeneration is fast.
Auto-detection
auto_detect returns true when the file index contains any .py files.
Same criterion as the Python analyzer.
Out of scope (first cut)
- Version pinning.
- Multiple output files (
requirements-dev.txt,requirements-test.txt). - Optional dependencies / extras (
pkg[extra]). - Reading existing
requirements.txtto preserve comments or pins. pyproject.tomlorsetup.pyoutput —requirements.txtonly.
Each is a clean follow-up if users ask.