Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Requirements Generator — Design

A processor that scans Python source files and produces a requirements.txt listing the third-party distributions the project imports. Fills the gap between the Python analyzer (which discovers local dep edges) and the pip processor (which consumes requirements.txt).

Problem

Users have Python projects with import statements. They want the set of PyPI distributions their code needs, written out to requirements.txt. Today they maintain this file by hand, which drifts from the actual imports.

Shape

A whole-project Generator processor named requirements:

  • Inputs: every .py file in the project (same scan as the Python analyzer — file_index.scan(&self.config.standard, true)).
  • Output: a single requirements.txt (path configurable).
  • Discovery: one Product with all .py files as inputs, one output path. Structurally identical to the tags processor.

The classification problem

Every import X lands in one of three buckets:

  1. Local — a module that resolves to a file in the project. Skip.
  2. Stdlib — a module shipped with Python (os, sys, json, …). Skip.
  3. Third-party — a PyPI distribution. Emit to requirements.txt.

The Python analyzer already resolves bucket 1 via PythonDepAnalyzer::resolve_module. The new processor needs buckets 2 and 3.

Stdlib detection

Python 3.10+ ships sys.stdlib_module_names — a frozenset of every stdlib top-level module name. We bake this list into a static table (src/processors/generators/python_stdlib.rs) rather than probing python3 at build time. Reasons:

  • The list is stable across 3.10+ with a handful of additions per minor release.
  • No tool dependency at build time — keeps the processor offline and hermetic.
  • The list is ~300 names, a few KB of source.

A refresh script regenerates the table from python3 -c 'import sys; print(sorted(sys.stdlib_module_names))' when we bump Python support. The list lives alongside the processor, not in a user-facing config.

Import → distribution mapping

The import name is not always the PyPI distribution name:

ImportDistribution
cv2opencv-python
yamlPyYAML
PILPillow
sklearnscikit-learn
bs4beautifulsoup4

We bake a curated table of the common ~40 mismatches into the processor and default everything else to identity (import X → distribution X). Users override via config:

[processor.requirements.mapping]
cv2 = "opencv-python"
custom_internal = "our-private-dist"

User entries win over the built-in table. This is lossy by design — we accept that unusual packages need a config entry — in exchange for:

  • No dependency on an installed Python environment.
  • requirements.txt generation works on a clean checkout (no chicken-and-egg with pip install).
  • Deterministic output regardless of the caller’s environment.

The alternative — probing importlib.metadata.packages_distributions() — is more accurate but requires packages to already be installed. Rejected for now; can be added later as an opt-in resolve = "probe" mode if users hit the mapping ceiling.

Configuration

[processor.requirements]
output = "requirements.txt"           # Output file path
exclude = []                          # Import names to never emit (e.g. internal vendored modules)
sorted = true                         # Sort output alphabetically (vs. discovery order)
header = true                         # Emit a "# Generated by rsconstruct" header line

[processor.requirements.mapping]
cv2 = "opencv-python"                 # User-provided import → distribution overrides
KeyTypeDefaultDescription
outputstring"requirements.txt"Output file path
excludestring[][]Import names to never emit
sortedbooltrueSort entries alphabetically
headerbooltrueInclude a comment header line
mappingmap{}Per-project import→distribution overrides

Pinning (pkg==1.2.3) is deferred. The first iteration emits bare names. Adding pinning later means probing pip show or parsing a lockfile — separate concern.

Code organization

Shared import scanner

Factor the regex scanning out of src/analyzers/python.rs into a module function shared by the analyzer and the generator:

#![allow(unused)]
fn main() {
// src/analyzers/python.rs
pub(crate) fn scan_python_imports(path: &Path) -> Result<Vec<String>> { ... }
}

Returns the raw top-level module names found in import X and from X import ... lines. The analyzer then runs this through resolve_module to keep local ones; the generator runs it through the stdlib table and mapping to produce the final list.

This fixes architecture-observations #6 (analyzers can’t hand data to processors) at the scope of this one feature: instead of building a cross-processor channel, we share a pure function.

Files

  • src/processors/generators/requirements.rs — the processor, ~150 lines.
  • src/processors/generators/python_stdlib.rs — the stdlib names table (static &[&str]) and a is_stdlib(module: &str) -> bool helper.
  • src/processors/generators/distribution_map.rs — the curated import→distribution mapping, a resolve_distribution(import: &str) -> &str helper that falls through to identity.
  • src/config/processor_configs.rs — add RequirementsConfig.
  • src/processors/mod.rs — add pub const REQUIREMENTS = "requirements" to names module.
  • docs/src/processors/requirements.md — user-facing processor doc.

Processor structure

Mirrors tags (whole-project generator with one output):

#![allow(unused)]
fn main() {
pub struct RequirementsProcessor {
    base: ProcessorBase,
    config: RequirementsConfig,
}

impl Processor for RequirementsProcessor {
    fn discover(&self, graph, file_index, instance_name) -> Result<()> {
        // Scan for .py files; if none, no product.
        // Add one product: inputs=all .py files, outputs=[output_path].
    }

    fn supports_batch(&self) -> bool { false }

    fn execute(&self, _ctx, product) -> Result<()> {
        // 1. Scan each input .py for imports.
        // 2. For each top-level module name:
        //    - Skip if local (resolves to a project file).
        //    - Skip if stdlib.
        //    - Skip if in user's `exclude`.
        //    - Map import → distribution name.
        // 3. Dedupe, sort if configured, write to output.
    }
}
}

Cache behavior

Falls naturally out of the descriptor-based cache:

  • Inputs: every .py file + config hash.
  • Output: requirements.txt.
  • Adding/removing an import changes file contents, triggers rebuild.
  • Changing config (new mapping entry, new exclude) changes config hash, triggers rebuild.
  • Code changes inside a function that don’t affect imports still trigger a rebuild, since we can’t cheaply know which lines matter. Acceptable — the regeneration is fast.

Auto-detection

auto_detect returns true when the file index contains any .py files. Same criterion as the Python analyzer.

Out of scope (first cut)

  • Version pinning.
  • Multiple output files (requirements-dev.txt, requirements-test.txt).
  • Optional dependencies / extras (pkg[extra]).
  • Reading existing requirements.txt to preserve comments or pins.
  • pyproject.toml or setup.py output — requirements.txt only.

Each is a clean follow-up if users ask.