Introduction
rsdedup is a fast, Rust-based file deduplication tool. It scans directories for duplicate files and supports multiple actions: reporting, deleting, hardlinking, and symlinking duplicates.
Key Features
- Multiple actions — report, delete, hardlink, or symlink duplicates
- Smart comparison — size grouping, then partial 4KB hash, then full hash
- Multiple hash algorithms — SHA-256, xxHash, BLAKE3
- Persistent hash cache — avoids rehashing unchanged files across runs
- Parallel hashing — configurable thread count for fast scanning
- Flexible filtering — include/exclude globs, min/max file size
- Multiple output formats — human-readable text or JSON
- Dry-run mode — preview destructive operations before executing
- Shell completions — bash, zsh, and fish
Philosophy
rsdedup is designed to be:
- Safe by default — read-only operations unless you explicitly ask for changes
- Fast — multi-stage pipeline eliminates candidates early, parallel hashing
- Incremental — persistent cache means repeated scans are nearly instant
- Unix-friendly — composable with other tools via JSON output and meaningful exit codes
Installation
From crates.io
cargo install rsdedup
From source
git clone https://github.com/veltzer/rsdedup.git
cd rsdedup
cargo install --path .
Pre-built binaries
Download pre-built binaries for Linux (x86_64, aarch64), macOS (x86_64, aarch64), and Windows (x86_64) from the GitHub Releases page.
Shell completions
After installing, generate shell completions:
# Bash
rsdedup completions bash > ~/.local/share/bash-completion/completions/rsdedup
# Zsh
rsdedup completions zsh > ~/.zfunc/_rsdedup
# Fish
rsdedup completions fish > ~/.config/fish/completions/rsdedup.fish
Getting Started
Find duplicates
The simplest way to use rsdedup is to report duplicates in the current directory:
rsdedup dedup report
Or specify a path:
rsdedup dedup report /home/user/photos
Warm up the cache
For large directories, pre-populate the hash cache first. This makes subsequent operations much faster:
rsdedup cache scan /home/user/photos
Preview before acting
Always use --dry-run before destructive operations:
# See what would be deleted
rsdedup dedup delete --dry-run /home/user/photos
# See what would be hardlinked
rsdedup dedup hardlink --dry-run /home/user/photos
Delete duplicates
Delete duplicates, keeping the oldest file in each group:
rsdedup dedup delete --keep oldest /home/user/photos
Save space with hardlinks
Replace duplicates with hardlinks — all copies still appear as separate files but share disk space:
rsdedup dedup hardlink /home/user/photos
JSON output for scripting
rsdedup dedup report --output json /home/user/photos
Typical workflow
# 1. Warm cache (optional, speeds up repeated runs)
rsdedup cache scan ~/photos
# 2. See what's duplicated
rsdedup dedup report ~/photos
# 3. Preview cleanup
rsdedup dedup delete --dry-run --keep oldest ~/photos
# 4. Execute
rsdedup dedup delete --keep oldest ~/photos
Commands
rsdedup uses a two-level subcommand structure:
rsdedup <command> <subcommand> [options] [path]
Top-level commands
| Command | Description |
|---|---|
dedup | Find and act on duplicate files |
cache | Manage the hash cache |
version | Show version and build information |
complete | Generate shell completions |
Global options
These options apply to all commands that scan files. They are hidden from the short help (-h) but visible in the long help (--help).
| Flag | Description | Default |
|---|---|---|
--compare <METHOD> | Comparison method: size-hash, hash, byte-for-byte | size-hash |
--hash <ALGO> | Hash algorithm: sha256, xxhash, blake3 | sha256 |
--min-size <BYTES> | Minimum file size to consider | none |
--max-size <BYTES> | Maximum file size to consider | none |
-r, --recursive | Recurse into subdirectories | true |
--no-recursive | Do not recurse | false |
--follow-symlinks | Follow symbolic links | false |
-v, --verbose | Verbose output | false |
--output <FORMAT> | Output format: text, json | text |
-j, --jobs <N> | Number of parallel workers | CPU count |
--no-cache | Disable the hash cache | false |
--no-timing | Disable timing output | false |
--exclude <GLOB> | Exclude files matching pattern (repeatable) | none |
--include <GLOB> | Only include files matching pattern (repeatable) | none |
dedup
Find and act on duplicate files.
rsdedup dedup <subcommand> [options] [path]
All dedup subcommands default to the current directory if no path is given.
Subcommands
report
Find and report duplicate files. No files are modified.
rsdedup dedup report
rsdedup dedup report /home/user/photos
rsdedup dedup report --output json /data
delete
Delete duplicate files, keeping one copy per group.
rsdedup dedup delete /home/user/photos
rsdedup dedup delete --keep oldest /home/user/photos
rsdedup dedup delete --dry-run /home/user/photos
| Flag | Description | Default |
|---|---|---|
--keep <STRATEGY> | Which file to keep: interactive, first, newest, oldest, shortest-path | interactive |
-n, --dry-run | Show what would be done without making changes | false |
Keep strategies
| Strategy | Description |
|---|---|
interactive | Prompt for each duplicate group, showing files sorted alphabetically |
first | Keep the first file encountered during directory walk |
newest | Keep the file with the most recent modification time |
oldest | Keep the file with the oldest modification time |
shortest-path | Keep the file with the shortest path |
The default is interactive, which presents each duplicate group and lets you choose which file to keep. Use one of the other strategies for non-interactive (scripted) usage.
hardlink
Replace duplicate files with hardlinks to a single copy. All file paths continue to work, but they share the same disk blocks.
rsdedup dedup hardlink /data
rsdedup dedup hardlink --dry-run /data
| Flag | Description | Default |
|---|---|---|
-n, --dry-run | Show what would be done without making changes | false |
Hardlinks cannot cross filesystem boundaries. rsdedup will report an error if duplicates span different filesystems.
symlink
Replace duplicate files with symbolic links to a single copy.
rsdedup dedup symlink /data
rsdedup dedup symlink --dry-run /data
| Flag | Description | Default |
|---|---|---|
-n, --dry-run | Show what would be done without making changes | false |
cache
Manage the persistent hash cache stored at ~/.rsdedup/cache.db.
rsdedup cache <subcommand>
Subcommands
scan
Scan a directory and populate the hash cache with both partial (4KB) and full file hashes. No deduplication is performed.
rsdedup cache scan
rsdedup cache scan /home/user/photos
This is useful for warming up the cache before running dedup operations. On subsequent runs, unchanged files are skipped.
The scan command shows timing by default. Use --no-timing to suppress it.
cache location: /home/user/.rsdedup/cache.db
scanned 1234 files: 100 hashed, 1134 already cached
elapsed: 2.345s
clear
Delete all entries from the hash cache.
rsdedup cache clear
stats
Show detailed cache statistics.
rsdedup cache stats
Example output:
cache location: /home/user/.rsdedup/cache.db
total entries: 1234
database size: 4.50 MB (4718592 bytes)
total file size: 12.34 GB (13249974886 bytes)
with partial hash: 1234
with full hash: 1234
stale (file gone): 3
oldest entry: 5d ago
newest entry: 2m ago
hash algorithms:
sha256: 1234
prune
Remove cache entries for files that no longer exist on disk.
rsdedup cache prune
Example output:
pruned 42 stale entries
list
List all cache entries in tab-separated format, suitable for parsing with awk, cut, or other tools.
rsdedup cache list
Output columns:
| Column | Description |
|---|---|
path | File path |
size | File size in bytes |
algo | Hash algorithm used |
partial_hash | Partial hash (first 4KB), empty if not computed |
full_hash | Full file hash, empty if not computed |
cached_at | Unix timestamp when the entry was cached |
Example:
# List all cached files
rsdedup cache list
# Find entries for a specific directory
rsdedup cache list | awk -F'\t' '$1 ~ /photos/'
# Show only files with full hashes
rsdedup cache list | awk -F'\t' '$5 != ""'
version
Show version and build information.
rsdedup version
Example output:
rsdedup 0.1.0 by Mark Veltzer <mark.veltzer@gmail.com>
GIT_DESCRIBE: v0.1.0
GIT_SHA: abc123def456
GIT_BRANCH: master
GIT_DIRTY: false
RUSTC_SEMVER: 1.94.0
RUST_EDITION: 2024
BUILD_TIMESTAMP: 2026-03-24 01:30:35
complete
Generate shell completion scripts.
rsdedup complete <shell>
Supported shells: bash, zsh, fish, elvish, powershell.
Examples
# Bash
rsdedup complete bash > ~/.local/share/bash-completion/complete/rsdedup
# Zsh (add ~/.zfunc to your fpath)
rsdedup complete zsh > ~/.zfunc/_rsdedup
# Fish
rsdedup complete fish > ~/.config/fish/complete/rsdedup.fish
Comparison Strategies
rsdedup supports three strategies for determining whether files are duplicates. Choose with --compare <METHOD>.
size-hash (default)
The default strategy uses a multi-stage pipeline for best performance:
- Size grouping — files with unique sizes are immediately excluded (they can’t be duplicates)
- Partial hash — hash only the first 4KB of each file; files with unique partial hashes are excluded
- Full hash — hash the entire file for remaining candidates
This avoids reading entire files when a quick check can rule out matches. For most workloads, the vast majority of files are eliminated in stages 1 and 2.
rsdedup dedup report --compare size-hash # default, same as omitting
hash
Skip the partial hash stage and compute the full hash for all files in each size group.
This is simpler but slower for large files where the first 4KB would have been enough to distinguish them.
rsdedup dedup report --compare hash
byte-for-byte
Compare files byte-by-byte without hashing. This guarantees zero false positives (no hash collisions possible) but is slower because every pair of candidate files must be read and compared.
rsdedup dedup report --compare byte-for-byte
Which should I use?
| Strategy | Speed | False positives | Best for |
|---|---|---|---|
size-hash | Fastest | Theoretically possible (cryptographic hash) | General use |
hash | Fast | Theoretically possible | When files differ early and late but not in the first 4KB |
byte-for-byte | Slowest | Zero | When absolute certainty is required |
For virtually all practical use cases, size-hash is the right choice.
Hash Algorithms
rsdedup supports three hash algorithms. Choose with --hash <ALGO>.
SHA-256 (default)
A widely-used cryptographic hash function producing 256-bit digests. Very low collision probability.
rsdedup dedup report --hash sha256
xxHash (xxh3-128)
A non-cryptographic hash optimized for speed. Produces 128-bit digests. Significantly faster than SHA-256 for large files.
rsdedup dedup report --hash xxhash
Use xxHash when you’re scanning large datasets and trust that the files are not adversarially crafted.
BLAKE3
A modern cryptographic hash that’s both fast and secure. Often faster than SHA-256 while providing equivalent security.
rsdedup dedup report --hash blake3
Comparison
| Algorithm | Type | Output | Speed | Security |
|---|---|---|---|---|
| SHA-256 | Cryptographic | 256-bit | Moderate | High |
| xxHash | Non-cryptographic | 128-bit | Very fast | None |
| BLAKE3 | Cryptographic | 256-bit | Fast | High |
For most users, the default SHA-256 is fine. If performance matters more than cryptographic guarantees, use xxHash. If you want both speed and security, use BLAKE3.
Hash Cache
rsdedup maintains a persistent hash cache at ~/.rsdedup/cache.db to avoid rehashing files that haven’t changed.
How it works
The cache is a key-value store (using sled) where:
- Key: absolute file path
- Value: cached metadata and hash values
Each cache entry stores:
- File size
- Modification time (seconds + nanoseconds)
- Inode number
- Hash algorithm used
- Partial hash (first 4KB)
- Full file hash
- Timestamp of when the entry was cached
Cache invalidation
A cached hash is considered valid only if all of the following still match the current file:
- Size
- Modification time (mtime)
- Inode number
If any of these differ, the file is rehashed and the cache entry is updated.
Cache operations
# Pre-populate the cache
rsdedup cache scan /path/to/directory
# View cache statistics
rsdedup cache stats
# Clear the cache
rsdedup cache clear
Disabling the cache
Use --no-cache to skip the cache entirely for a single run:
rsdedup dedup report --no-cache /path
This is useful for benchmarking or when you suspect cache corruption.
Cache location
The cache is stored at ~/.rsdedup/cache.db. The directory is created automatically on first use.
Incremental scanning
The cache scan command is incremental. On repeated runs, only files that have changed (or are new) are hashed. Files that haven’t changed are skipped. Both partial (4KB) and full hashes are stored for every file.
Filtering
rsdedup provides several ways to control which files are considered.
Include / Exclude globs
Use --include and --exclude to filter files by glob pattern. Both flags can be repeated.
# Only scan image files
rsdedup dedup report --include '*.jpg' --include '*.png'
# Skip log files and git directories
rsdedup dedup report --exclude '*.log' --exclude '.git/**'
Patterns are matched against both the filename and the full path.
When --include is specified, only files matching at least one include pattern are considered. When --exclude is specified, files matching any exclude pattern are skipped. If both are specified, exclude takes priority.
File size filters
# Only consider files larger than 1MB
rsdedup dedup report --min-size 1048576
# Only consider files smaller than 100MB
rsdedup dedup report --max-size 104857600
# Combine both
rsdedup dedup report --min-size 1024 --max-size 104857600
Recursion
By default, rsdedup recurses into subdirectories. Use --no-recursive to scan only the top-level directory:
rsdedup dedup report --no-recursive /data
Symbolic links
By default, symbolic links are not followed. Use --follow-symlinks to follow them:
rsdedup dedup report --follow-symlinks /data
Parallelism
rsdedup uses multi-threading to speed up the comparison phase of duplicate detection.
How it works
The comparison pipeline processes files grouped by size. Each size group is processed independently, which makes it a natural fit for parallelism. rsdedup uses rayon to distribute size groups across a thread pool — multiple size groups are compared concurrently.
Size group A (all 1 KB files) ──→ Thread 1
Size group B (all 5 KB files) ──→ Thread 2
Size group C (all 12 KB files) ──→ Thread 3
...
What is parallelized
- Comparison phase — size groups are processed in parallel using rayon’s
par_iter. Each thread handles hashing and comparing files within one size group.
What is not parallelized
- Directory scanning — uses
walkdirwhich is single-threaded and I/O-bound. - Actions (delete, hardlink, symlink) — performed sequentially after duplicates are found.
- Within a single size group — files in the same size group are hashed and compared sequentially. This means a single large size group (many files of the same size) will not benefit from additional threads.
Controlling thread count
Use the --jobs (or -j) flag to set the number of worker threads:
# Use 4 threads
rsdedup dedup report --jobs 4 /data
# Use a single thread (no parallelism)
rsdedup dedup report --jobs 1 /data
The default is the number of CPU cores reported by std::thread::available_parallelism().
When parallelism helps most
Parallelism provides the biggest speedup when:
- There are many size groups with duplicates to compare — more groups means more work to distribute across threads.
- Files are large — hashing large files is CPU-intensive, so parallel hashing of different size groups gives a significant speedup.
- The storage is fast (SSD/NVMe) — on slow spinning disks, I/O is the bottleneck and adding threads may not help.
Parallelism helps less when:
- Most files fall into one or a few size groups — there isn’t enough independent work to distribute.
- Files are very small — hashing is fast and the overhead of thread coordination dominates.
- Using
--compare byte-for-byte— byte-for-byte comparison is I/O-heavy, so additional CPU threads offer less benefit.
Output Formats
rsdedup supports two output formats, selected with --output <FORMAT>.
Text (default)
Human-readable output showing duplicate groups and a summary.
Group 1 — 3 files, 12 bytes each (hash: a948904f2f0f479b):
/home/user/photos/img001.jpg
/home/user/photos/backup/img001.jpg
/home/user/photos/old/img001.jpg
--- Summary ---
Files scanned: 150
Duplicate groups: 1
Duplicate files: 2
Wasted space: 24 bytes
Action: report
Files affected: 0
Space recovered: 0 bytes
JSON
Machine-readable JSON output for scripting and integration with other tools.
rsdedup dedup report --output json /path
The duplicate groups are output as a JSON array:
[
{
"group": 1,
"size": 12,
"hash": "a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447",
"files": [
"/home/user/photos/img001.jpg",
"/home/user/photos/backup/img001.jpg",
"/home/user/photos/old/img001.jpg"
]
}
]
Followed by a JSON summary object:
{
"files_scanned": 150,
"duplicate_groups": 1,
"duplicate_files": 2,
"wasted_bytes": 24,
"action_taken": "report",
"files_affected": 0,
"bytes_recovered": 0
}
Exit Codes
rsdedup uses meaningful exit codes for scripting:
| Code | Meaning |
|---|---|
0 | Success, no duplicates found |
1 | Success, duplicates found |
2 | Error |
Examples
# Check if a directory has duplicates
if rsdedup dedup report /data > /dev/null 2>&1; then
echo "No duplicates"
else
echo "Duplicates found"
fi
# Use in CI to fail if duplicates exist
rsdedup dedup report /assets && echo "Clean" || echo "Duplicates detected"
Design
Overview
rsdedup is a fast, Rust-based file deduplication tool. It scans directories for duplicate files and supports multiple actions: reporting, hardlinking/symlinking, and deleting duplicates (keeping one copy).
Goals
- Fast duplicate detection across large directory trees
- Multiple comparison strategies (hash-based, size+hash, byte-for-byte)
- Multiple actions on duplicates (report, hardlink, symlink, delete)
- Safe defaults — report-only unless explicitly told to modify files
- Parallel file hashing for performance
Pipeline Architecture
rsdedup processes files through a multi-stage pipeline where each stage reduces the candidate set:
1. Scan → Walk directories, collect file metadata
2. Group → Group files by size (unique sizes eliminated)
3. Filter → Apply min-size, max-size, include/exclude filters
4. Compare → Compare candidates using the chosen strategy
5. Act → Perform the chosen action on duplicate groups
Module Structure
src/
├── main.rs — Orchestration and command dispatch
├── cli.rs — CLI definitions (clap derive)
├── scanner.rs — Directory walking with walkdir
├── grouper.rs — Group files by size
├── compare.rs — Comparison strategies (size-hash, hash, byte-for-byte)
├── hasher.rs — Hash implementations (SHA-256, xxHash, BLAKE3)
├── cache.rs — Persistent hash cache (sled)
├── action.rs — Actions: report, delete, hardlink, symlink
├── output.rs — Output formatting (text, JSON)
├── types.rs — Shared types
└── error.rs — Exit codes
Key Types
#![allow(unused)]
fn main() {
struct FileEntry {
path: PathBuf,
size: u64,
metadata: Metadata,
}
struct DuplicateGroup {
size: u64,
hash: String,
files: Vec<FileEntry>,
}
enum CompareMethod {
SizeHash,
Hash,
ByteForByte,
}
enum KeepStrategy {
First,
Newest,
Oldest,
ShortestPath,
}
}
Parallelism
- Directory walking is single-threaded (I/O bound, using
walkdir) - File comparison uses a
rayonthread pool — size groups are processed in parallel - Within a single size group, files are hashed and compared sequentially
- Thread count is configurable via
--jobs(defaults to CPU core count)
See the Parallelism chapter for details on controlling thread count and when parallelism helps most.
Cache Design
The hash cache uses sled, an embedded key-value store at ~/.rsdedup/cache.db. Each entry maps a file path to its metadata (size, mtime, inode) and hash values (partial and full). Cache entries are invalidated when any metadata field changes. The cache merges partial and full hashes — computing one doesn’t overwrite the other.
See the Hash Cache chapter for details.
Safety
- Default action is report-only — no files are modified unless explicitly requested
--dry-runshows what would happen without making changes- No cross-filesystem hardlinks — detected and reported as errors
- Symlink loops are avoided by not following symlinks by default
Exit Codes
| Code | Meaning |
|---|---|
0 | Success, no duplicates found |
1 | Success, duplicates found |
2 | Error |
This makes rsdedup scriptable (e.g. rsdedup report && echo "clean").
Dependencies
| Crate | Purpose |
|---|---|
clap | CLI argument parsing |
clap_complete | Shell completion generation |
walkdir | Recursive directory traversal |
rayon | Parallel hashing |
sha2 | SHA-256 |
xxhash-rust | xxHash (xxh3-128) |
blake3 | BLAKE3 |
sled | Embedded key-value cache |
bincode | Cache entry serialization |
serde / serde_json | JSON output |
globset | Include/exclude glob matching |
anyhow | Error handling |