Introduction

rsdedup is a fast, Rust-based file deduplication tool. It scans directories for duplicate files and supports multiple actions: reporting, deleting, hardlinking, and symlinking duplicates.

Key Features

Multiple actions — report, delete, hardlink, or symlink duplicates
Smart comparison — size grouping, then partial 4KB hash, then full hash
Multiple hash algorithms — SHA-256, xxHash, BLAKE3
Persistent hash cache — avoids rehashing unchanged files across runs
Parallel hashing — configurable thread count for fast scanning
Flexible filtering — include/exclude globs, min/max file size
Multiple output formats — human-readable text or JSON
Dry-run mode — preview destructive operations before executing
Shell completions — bash, zsh, and fish

Philosophy

rsdedup is designed to be:

Safe by default — read-only operations unless you explicitly ask for changes
Fast — multi-stage pipeline eliminates candidates early, parallel hashing
Incremental — persistent cache means repeated scans are nearly instant
Unix-friendly — composable with other tools via JSON output and meaningful exit codes

Installation

From crates.io

cargo install rsdedup

From source

git clone https://github.com/veltzer/rsdedup.git
cd rsdedup
cargo install --path .

Pre-built binaries

Download pre-built binaries for Linux (x86_64, aarch64), macOS (x86_64, aarch64), and Windows (x86_64) from the GitHub Releases page.

Shell completions

After installing, generate shell completions:

# Bash
rsdedup completions bash > ~/.local/share/bash-completion/completions/rsdedup

# Zsh
rsdedup completions zsh > ~/.zfunc/_rsdedup

# Fish
rsdedup completions fish > ~/.config/fish/completions/rsdedup.fish

Getting Started

Find duplicates

The simplest way to use rsdedup is to report duplicates in the current directory:

rsdedup dedup report

Or specify a path:

rsdedup dedup report /home/user/photos

Warm up the cache

For large directories, pre-populate the hash cache first. This makes subsequent operations much faster:

rsdedup cache scan /home/user/photos

Preview before acting

Always use --dry-run before destructive operations:

# See what would be deleted
rsdedup dedup delete --dry-run /home/user/photos

# See what would be hardlinked
rsdedup dedup hardlink --dry-run /home/user/photos

Delete duplicates

Delete duplicates, keeping the oldest file in each group:

rsdedup dedup delete --keep oldest /home/user/photos

Save space with hardlinks

Replace duplicates with hardlinks — all copies still appear as separate files but share disk space:

rsdedup dedup hardlink /home/user/photos

JSON output for scripting

rsdedup dedup report --output json /home/user/photos

Typical workflow

# 1. Warm cache (optional, speeds up repeated runs)
rsdedup cache scan ~/photos

# 2. See what's duplicated
rsdedup dedup report ~/photos

# 3. Preview cleanup
rsdedup dedup delete --dry-run --keep oldest ~/photos

# 4. Execute
rsdedup dedup delete --keep oldest ~/photos

Commands

rsdedup uses a two-level subcommand structure:

rsdedup <command> <subcommand> [options] [path]

Top-level commands

Command	Description
`dedup`	Find and act on duplicate files
`cache`	Manage the hash cache
`version`	Show version and build information
`complete`	Generate shell completions

Global options

These options apply to all commands that scan files. They are hidden from the short help (-h) but visible in the long help (--help).

Flag	Description	Default
`--compare <METHOD>`	Comparison method: `size-hash`, `hash`, `byte-for-byte`	`size-hash`
`--hash <ALGO>`	Hash algorithm: `sha256`, `xxhash`, `blake3`	`sha256`
`--min-size <BYTES>`	Minimum file size to consider	none
`--max-size <BYTES>`	Maximum file size to consider	none
`-r, --recursive`	Recurse into subdirectories	`true`
`--no-recursive`	Do not recurse	`false`
`--follow-symlinks`	Follow symbolic links	`false`
`-v, --verbose`	Verbose output	`false`
`--output <FORMAT>`	Output format: `text`, `json`	`text`
`-j, --jobs <N>`	Number of parallel workers	CPU count
`--no-cache`	Disable the hash cache	`false`
`--no-timing`	Disable timing output	`false`
`--exclude <GLOB>`	Exclude files matching pattern (repeatable)	none
`--include <GLOB>`	Only include files matching pattern (repeatable)	none

dedup

Find and act on duplicate files.

rsdedup dedup <subcommand> [options] [path]

All dedup subcommands default to the current directory if no path is given.

Subcommands

report

Find and report duplicate files. No files are modified.

rsdedup dedup report
rsdedup dedup report /home/user/photos
rsdedup dedup report --output json /data

delete

Delete duplicate files, keeping one copy per group.

rsdedup dedup delete /home/user/photos
rsdedup dedup delete --keep oldest /home/user/photos
rsdedup dedup delete --dry-run /home/user/photos

Flag	Description	Default
`--keep <STRATEGY>`	Which file to keep: `interactive`, `first`, `newest`, `oldest`, `shortest-path`	`interactive`
`-n, --dry-run`	Show what would be done without making changes	`false`

Keep strategies

Strategy	Description
`interactive`	Prompt for each duplicate group, showing files sorted alphabetically
`first`	Keep the first file encountered during directory walk
`newest`	Keep the file with the most recent modification time
`oldest`	Keep the file with the oldest modification time
`shortest-path`	Keep the file with the shortest path

The default is interactive, which presents each duplicate group and lets you choose which file to keep. Use one of the other strategies for non-interactive (scripted) usage.

hardlink

Replace duplicate files with hardlinks to a single copy. All file paths continue to work, but they share the same disk blocks.

rsdedup dedup hardlink /data
rsdedup dedup hardlink --dry-run /data

Flag	Description	Default
`-n, --dry-run`	Show what would be done without making changes	`false`

Hardlinks cannot cross filesystem boundaries. rsdedup will report an error if duplicates span different filesystems.

symlink

Replace duplicate files with symbolic links to a single copy.

rsdedup dedup symlink /data
rsdedup dedup symlink --dry-run /data

Flag	Description	Default
`-n, --dry-run`	Show what would be done without making changes	`false`

cache

Manage the persistent hash cache stored at ~/.rsdedup/cache.db.

rsdedup cache <subcommand>

Subcommands

scan

Scan a directory and populate the hash cache with both partial (4KB) and full file hashes. No deduplication is performed.

rsdedup cache scan
rsdedup cache scan /home/user/photos

This is useful for warming up the cache before running dedup operations. On subsequent runs, unchanged files are skipped.

The scan command shows timing by default. Use --no-timing to suppress it.

cache location: /home/user/.rsdedup/cache.db
scanned 1234 files: 100 hashed, 1134 already cached
elapsed: 2.345s

clear

Delete all entries from the hash cache.

rsdedup cache clear

stats

Show detailed cache statistics.

rsdedup cache stats

Example output:

cache location:     /home/user/.rsdedup/cache.db
total entries:       1234
database size:       4.50 MB (4718592 bytes)
total file size:     12.34 GB (13249974886 bytes)
with partial hash:   1234
with full hash:      1234
stale (file gone):   3
oldest entry:        5d ago
newest entry:        2m ago
hash algorithms:
  sha256: 1234

prune

Remove cache entries for files that no longer exist on disk.

rsdedup cache prune

Example output:

pruned 42 stale entries

list

List all cache entries in tab-separated format, suitable for parsing with awk, cut, or other tools.

rsdedup cache list

Output columns:

Column	Description
`path`	File path
`size`	File size in bytes
`algo`	Hash algorithm used
`partial_hash`	Partial hash (first 4KB), empty if not computed
`full_hash`	Full file hash, empty if not computed
`cached_at`	Unix timestamp when the entry was cached

Example:

# List all cached files
rsdedup cache list

# Find entries for a specific directory
rsdedup cache list | awk -F'\t' '$1 ~ /photos/'

# Show only files with full hashes
rsdedup cache list | awk -F'\t' '$5 != ""'

version

Show version and build information.

rsdedup version

Example output:

rsdedup 0.1.0 by Mark Veltzer <mark.veltzer@gmail.com>
GIT_DESCRIBE: v0.1.0
GIT_SHA: abc123def456
GIT_BRANCH: master
GIT_DIRTY: false
RUSTC_SEMVER: 1.94.0
RUST_EDITION: 2024
BUILD_TIMESTAMP: 2026-03-24 01:30:35

complete

Generate shell completion scripts.

rsdedup complete <shell>

Supported shells: bash, zsh, fish, elvish, powershell.

Examples

# Bash
rsdedup complete bash > ~/.local/share/bash-completion/complete/rsdedup

# Zsh (add ~/.zfunc to your fpath)
rsdedup complete zsh > ~/.zfunc/_rsdedup

# Fish
rsdedup complete fish > ~/.config/fish/complete/rsdedup.fish

Comparison Strategies

rsdedup supports three strategies for determining whether files are duplicates. Choose with --compare <METHOD>.

size-hash (default)

The default strategy uses a multi-stage pipeline for best performance:

Size grouping — files with unique sizes are immediately excluded (they can’t be duplicates)
Partial hash — hash only the first 4KB of each file; files with unique partial hashes are excluded
Full hash — hash the entire file for remaining candidates

This avoids reading entire files when a quick check can rule out matches. For most workloads, the vast majority of files are eliminated in stages 1 and 2.

rsdedup dedup report --compare size-hash  # default, same as omitting

hash

Skip the partial hash stage and compute the full hash for all files in each size group.

This is simpler but slower for large files where the first 4KB would have been enough to distinguish them.

rsdedup dedup report --compare hash

byte-for-byte

Compare files byte-by-byte without hashing. This guarantees zero false positives (no hash collisions possible) but is slower because every pair of candidate files must be read and compared.

rsdedup dedup report --compare byte-for-byte

Which should I use?

Strategy	Speed	False positives	Best for
`size-hash`	Fastest	Theoretically possible (cryptographic hash)	General use
`hash`	Fast	Theoretically possible	When files differ early and late but not in the first 4KB
`byte-for-byte`	Slowest	Zero	When absolute certainty is required

For virtually all practical use cases, size-hash is the right choice.

Hash Algorithms

rsdedup supports three hash algorithms. Choose with --hash <ALGO>.

SHA-256 (default)

A widely-used cryptographic hash function producing 256-bit digests. Very low collision probability.

rsdedup dedup report --hash sha256

xxHash (xxh3-128)

A non-cryptographic hash optimized for speed. Produces 128-bit digests. Significantly faster than SHA-256 for large files.

rsdedup dedup report --hash xxhash

Use xxHash when you’re scanning large datasets and trust that the files are not adversarially crafted.

BLAKE3

A modern cryptographic hash that’s both fast and secure. Often faster than SHA-256 while providing equivalent security.

rsdedup dedup report --hash blake3

Comparison

Algorithm	Type	Output	Speed	Security
SHA-256	Cryptographic	256-bit	Moderate	High
xxHash	Non-cryptographic	128-bit	Very fast	None
BLAKE3	Cryptographic	256-bit	Fast	High

For most users, the default SHA-256 is fine. If performance matters more than cryptographic guarantees, use xxHash. If you want both speed and security, use BLAKE3.

Hash Cache

rsdedup maintains a persistent hash cache at ~/.rsdedup/cache.db to avoid rehashing files that haven’t changed.

How it works

The cache is a key-value store (using sled) where:

Key: absolute file path
Value: cached metadata and hash values

Each cache entry stores:

File size
Modification time (seconds + nanoseconds)
Inode number
Hash algorithm used
Partial hash (first 4KB)
Full file hash
Timestamp of when the entry was cached

Cache invalidation

A cached hash is considered valid only if all of the following still match the current file:

Size
Modification time (mtime)
Inode number

If any of these differ, the file is rehashed and the cache entry is updated.

Cache operations

# Pre-populate the cache
rsdedup cache scan /path/to/directory

# View cache statistics
rsdedup cache stats

# Clear the cache
rsdedup cache clear

Disabling the cache

Use --no-cache to skip the cache entirely for a single run:

rsdedup dedup report --no-cache /path

This is useful for benchmarking or when you suspect cache corruption.

Cache location

The cache is stored at ~/.rsdedup/cache.db. The directory is created automatically on first use.

The cache scan command is incremental. On repeated runs, only files that have changed (or are new) are hashed. Files that haven’t changed are skipped. Both partial (4KB) and full hashes are stored for every file.

Filtering

rsdedup provides several ways to control which files are considered.

Include / Exclude globs

Use --include and --exclude to filter files by glob pattern. Both flags can be repeated.

# Only scan image files
rsdedup dedup report --include '*.jpg' --include '*.png'

# Skip log files and git directories
rsdedup dedup report --exclude '*.log' --exclude '.git/**'

Patterns are matched against both the filename and the full path.

When --include is specified, only files matching at least one include pattern are considered. When --exclude is specified, files matching any exclude pattern are skipped. If both are specified, exclude takes priority.

File size filters

# Only consider files larger than 1MB
rsdedup dedup report --min-size 1048576

# Only consider files smaller than 100MB
rsdedup dedup report --max-size 104857600

# Combine both
rsdedup dedup report --min-size 1024 --max-size 104857600

Recursion

By default, rsdedup recurses into subdirectories. Use --no-recursive to scan only the top-level directory:

rsdedup dedup report --no-recursive /data

Symbolic links

By default, symbolic links are not followed. Use --follow-symlinks to follow them:

rsdedup dedup report --follow-symlinks /data

Parallelism

rsdedup uses multi-threading to speed up the comparison phase of duplicate detection.

How it works

The comparison pipeline processes files grouped by size. Each size group is processed independently, which makes it a natural fit for parallelism. rsdedup uses rayon to distribute size groups across a thread pool — multiple size groups are compared concurrently.

Size group A (all 1 KB files)  ──→  Thread 1
Size group B (all 5 KB files)  ──→  Thread 2
Size group C (all 12 KB files) ──→  Thread 3
...

What is parallelized

Comparison phase — size groups are processed in parallel using rayon’s par_iter. Each thread handles hashing and comparing files within one size group.

What is not parallelized

Directory scanning — uses walkdir which is single-threaded and I/O-bound.
Actions (delete, hardlink, symlink) — performed sequentially after duplicates are found.
Within a single size group — files in the same size group are hashed and compared sequentially. This means a single large size group (many files of the same size) will not benefit from additional threads.

Controlling thread count

Use the --jobs (or -j) flag to set the number of worker threads:

# Use 4 threads
rsdedup dedup report --jobs 4 /data

# Use a single thread (no parallelism)
rsdedup dedup report --jobs 1 /data

The default is the number of CPU cores reported by std::thread::available_parallelism().

When parallelism helps most

Parallelism provides the biggest speedup when:

There are many size groups with duplicates to compare — more groups means more work to distribute across threads.
Files are large — hashing large files is CPU-intensive, so parallel hashing of different size groups gives a significant speedup.
The storage is fast (SSD/NVMe) — on slow spinning disks, I/O is the bottleneck and adding threads may not help.

Parallelism helps less when:

Most files fall into one or a few size groups — there isn’t enough independent work to distribute.
Files are very small — hashing is fast and the overhead of thread coordination dominates.
Using --compare byte-for-byte — byte-for-byte comparison is I/O-heavy, so additional CPU threads offer less benefit.

Output Formats

rsdedup supports two output formats, selected with --output <FORMAT>.

Text (default)

Human-readable output showing duplicate groups and a summary.

Group 1 — 3 files, 12 bytes each (hash: a948904f2f0f479b):
  /home/user/photos/img001.jpg
  /home/user/photos/backup/img001.jpg
  /home/user/photos/old/img001.jpg

--- Summary ---
Files scanned:    150
Duplicate groups: 1
Duplicate files:  2
Wasted space:     24 bytes
Action:           report
Files affected:   0
Space recovered:  0 bytes

JSON

Machine-readable JSON output for scripting and integration with other tools.

rsdedup dedup report --output json /path

The duplicate groups are output as a JSON array:

[
  {
    "group": 1,
    "size": 12,
    "hash": "a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447",
    "files": [
      "/home/user/photos/img001.jpg",
      "/home/user/photos/backup/img001.jpg",
      "/home/user/photos/old/img001.jpg"
    ]
  }
]

Followed by a JSON summary object:

{
  "files_scanned": 150,
  "duplicate_groups": 1,
  "duplicate_files": 2,
  "wasted_bytes": 24,
  "action_taken": "report",
  "files_affected": 0,
  "bytes_recovered": 0
}

Exit Codes

rsdedup uses meaningful exit codes for scripting:

Code	Meaning
`0`	Success, no duplicates found
`1`	Success, duplicates found
`2`	Error

Examples

# Check if a directory has duplicates
if rsdedup dedup report /data > /dev/null 2>&1; then
    echo "No duplicates"
else
    echo "Duplicates found"
fi

# Use in CI to fail if duplicates exist
rsdedup dedup report /assets && echo "Clean" || echo "Duplicates detected"

Design

Overview

rsdedup is a fast, Rust-based file deduplication tool. It scans directories for duplicate files and supports multiple actions: reporting, hardlinking/symlinking, and deleting duplicates (keeping one copy).

Goals

Fast duplicate detection across large directory trees
Multiple comparison strategies (hash-based, size+hash, byte-for-byte)
Multiple actions on duplicates (report, hardlink, symlink, delete)
Safe defaults — report-only unless explicitly told to modify files
Parallel file hashing for performance

Pipeline Architecture

rsdedup processes files through a multi-stage pipeline where each stage reduces the candidate set:

1. Scan       →  Walk directories, collect file metadata
2. Group      →  Group files by size (unique sizes eliminated)
3. Filter     →  Apply min-size, max-size, include/exclude filters
4. Compare    →  Compare candidates using the chosen strategy
5. Act        →  Perform the chosen action on duplicate groups

Module Structure

src/
├── main.rs       — Orchestration and command dispatch
├── cli.rs        — CLI definitions (clap derive)
├── scanner.rs    — Directory walking with walkdir
├── grouper.rs    — Group files by size
├── compare.rs    — Comparison strategies (size-hash, hash, byte-for-byte)
├── hasher.rs     — Hash implementations (SHA-256, xxHash, BLAKE3)
├── cache.rs      — Persistent hash cache (sled)
├── action.rs     — Actions: report, delete, hardlink, symlink
├── output.rs     — Output formatting (text, JSON)
├── types.rs      — Shared types
└── error.rs      — Exit codes

Key Types

#![allow(unused)]
fn main() {
struct FileEntry {
    path: PathBuf,
    size: u64,
    metadata: Metadata,
}

struct DuplicateGroup {
    size: u64,
    hash: String,
    files: Vec<FileEntry>,
}

enum CompareMethod {
    SizeHash,
    Hash,
    ByteForByte,
}

enum KeepStrategy {
    First,
    Newest,
    Oldest,
    ShortestPath,
}
}

Parallelism

Directory walking is single-threaded (I/O bound, using walkdir)
File comparison uses a rayon thread pool — size groups are processed in parallel
Within a single size group, files are hashed and compared sequentially
Thread count is configurable via --jobs (defaults to CPU core count)

See the Parallelism chapter for details on controlling thread count and when parallelism helps most.

Cache Design

The hash cache uses sled, an embedded key-value store at ~/.rsdedup/cache.db. Each entry maps a file path to its metadata (size, mtime, inode) and hash values (partial and full). Cache entries are invalidated when any metadata field changes. The cache merges partial and full hashes — computing one doesn’t overwrite the other.

See the Hash Cache chapter for details.

Safety

Default action is report-only — no files are modified unless explicitly requested
--dry-run shows what would happen without making changes
No cross-filesystem hardlinks — detected and reported as errors
Symlink loops are avoided by not following symlinks by default

Exit Codes

Code	Meaning
`0`	Success, no duplicates found
`1`	Success, duplicates found
`2`	Error

This makes rsdedup scriptable (e.g. rsdedup report && echo "clean").

Dependencies

Crate	Purpose
`clap`	CLI argument parsing
`clap_complete`	Shell completion generation
`walkdir`	Recursive directory traversal
`rayon`	Parallel hashing
`sha2`	SHA-256
`xxhash-rust`	xxHash (xxh3-128)
`blake3`	BLAKE3
`sled`	Embedded key-value cache
`bincode`	Cache entry serialization
`serde` / `serde_json`	JSON output
`globset`	Include/exclude glob matching
`anyhow`	Error handling

Keyboard shortcuts

rsdedup - A Fast File Deduplication Tool