File Deduplicator OpenClaw Skill - ClawHub
Do you want your AI agent to automate File Deduplicator workflows? This free skill from ClawHub helps with ai & llms tasks without building custom tools from scratch.
What this skill does
Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.
Install
npx clawhub@latest install file-deduplicatorFull SKILL.md
Open original| name | description |
|---|---|
| file-deduplicator | Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management. |
File-Deduplicator - Find and Remove Duplicates
Vernox Utility Skill - Clean up your digital hoard.
Overview
File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.
Features
✅ Duplicate Detection
- Content-based hashing (MD5) for fast comparison
- Size-based detection (exact match, near match)
- Name-based detection (similar filenames)
- Directory scanning (recursive)
- Exclude patterns (.git, node_modules, etc.)
✅ Removal Options
- Auto-delete duplicates (keep newest/oldest)
- Interactive review before deletion
- Move to archive instead of delete
- Preserve permissions and metadata
- Dry-run mode (preview changes)
✅ Analysis Tools
- Duplicate count summary
- Space savings estimation
- Largest duplicate files
- Most common duplicate patterns
- Detailed report generation
✅ Safety Features
- Confirmation prompts before deletion
- Backup to archive folder
- Size threshold (don't remove huge files by mistake)
- Whitelist important directories
- Undo functionality (log for recovery)
Installation
clawhub install file-deduplicator
Quick Start
Find Duplicates in Directory
const result = await findDuplicates({
directories: ['./documents', './downloads', './projects'],
options: {
method: 'content', // content-based comparison
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);
Remove Duplicates Automatically
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest', // keep newest, delete oldest
action: 'delete', // or 'move' to archive
autoConfirm: false // show confirmation for each
}
});
console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);
Dry-Run Preview
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest',
action: 'delete',
dryRun: true // Preview without actual deletion
}
});
console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
console.log(`${i+1}. ${dup.file}`);
});
Tool Functions
findDuplicates
Find duplicate files across directories.
Parameters:
directories(array|string, required): Directory paths to scanoptions(object, optional):method(string): 'content' | 'size' | 'name' - comparison methodincludeSubdirs(boolean): Scan recursively (default: true)minSize(number): Minimum size in bytes (default: 0)maxSize(number): Maximum size in bytes (default: 0)excludePatterns(array): Glob patterns to exclude (default: ['.git', 'node_modules'])whitelist(array): Directories to never scan (default: [])
Returns:
duplicates(array): Array of duplicate groupsduplicateCount(number): Number of duplicate groups foundtotalFiles(number): Total files scannedscanDuration(number): Time taken to scan (ms)spaceWasted(number): Total bytes wasted by duplicatesspaceSaved(number): Potential savings if duplicates removed
removeDuplicates
Remove duplicate files based on findings.
Parameters:
directories(array|string, required): Same as findDuplicatesoptions(object, optional):keep(string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keepaction(string): 'delete' | 'move' | 'archive'archivePath(string): Where to move files when action='move'dryRun(boolean): Preview without actual actionautoConfirm(boolean): Auto-confirm deletionssizeThreshold(number): Don't remove files larger than this
Returns:
filesRemoved(number): Number of files removed/movedspaceSaved(number): Bytes savedgroupsProcessed(number): Number of duplicate groups handledlogPath(string): Path to action logerrors(array): Any errors encountered
analyzeDirectory
Analyze a single directory for duplicates.
Parameters:
directory(string, required): Path to directoryoptions(object, optional): Same as findDuplicates options
Returns:
fileCount(number): Total files in directorytotalSize(number): Total bytes in directoryduplicateSize(number): Bytes in duplicate filesduplicateRatio(number): Percentage of files that are duplicates
Use Cases
Digital Hoarder Cleanup
- Find duplicate photos/videos
- Identify wasted storage space
- Remove old duplicates, keep newest
- Clean up download folders
Document Management
- Find duplicate PDFs, docs, reports
- Keep latest version, archive old versions
- Prevent version confusion
- Reduce backup bloat
Project Cleanup
- Find duplicate source files
- Remove duplicate build artifacts
- Clean up node_modules duplicates
- Save storage on SSD/HDD
Backup Optimization
- Find duplicate backup files
- Remove redundant backups
- Identify what's actually duplicated
- Save space on backup drives
Configuration
Edit config.json:
{
"detection": {
"defaultMethod": "content",
"sizeTolerancePercent": 0, // exact match only
"nameSimilarity": 0.7, // 0-1, lower = more similar
"includeSubdirs": true
},
"removal": {
"defaultAction": "delete",
"defaultKeep": "newest",
"archivePath": "./archive",
"sizeThreshold": 10485760, // 10MB threshold
"autoConfirm": false,
"dryRunDefault": false
},
"exclude": {
"patterns": [".git", "node_modules", ".vscode", ".idea"],
"whitelist": ["important", "work", "projects"]
}
}
Methods
Content-Based (Recommended)
- Fast MD5 hashing
- Detects exact duplicates regardless of filename
- Works across renamed files
- Perfect for documents, code, archives
Size-Based
- Compares file sizes
- Faster than content hashing
- Good for media files where content hashing is slow
- Finds near-duplicates (similar but not exact)
Name-Based
- Compares filenames
- Detects similar named files
- Good for finding version duplicates (file_v1, file_v2)
Examples
Find Duplicates in Documents
const result = await findDuplicates({
directories: '~/Documents',
options: {
method: 'content',
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
console.log(`Set ${i+1}: ${set.files.length} files`);
console.log(` Total size: ${set.totalSize} bytes`);
});
Remove Duplicates, Keep Newest
const result = await removeDuplicates({
directories: '~/Documents',
options: {
keep: 'newest',
action: 'delete'
}
});
console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);
Move to Archive Instead of Delete
const result = await removeDuplicates({
directories: '~/Downloads',
options: {
keep: 'newest',
action: 'move',
archivePath: '~/Documents/Archive'
}
});
console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);
Dry-Run Preview Changes
const result = await removeDuplicates({
directories: '~/Documents',
options: {
dryRun: true // Just show what would happen
}
});
console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
console.log(`Would delete: ${set.toDelete.join(', ')}`);
});
Performance
Scanning Speed
- Small directories (<1000 files): <1s
- Medium directories (1000-10000 files): 1-5s
- Large directories (10000+ files): 5-20s
Detection Accuracy
- Content-based: 100% (exact duplicates)
- Size-based: Fast but may miss renamed files
- Name-based: Detects naming patterns only
Memory Usage
- Hash cache: ~1MB per 100,000 files
- Batch processing: Processes 1000 files at a time
- Peak memory: ~200MB for 1M files
Safety Features
Size Thresholding
Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.
Archive Mode
Move files to archive directory instead of deleting. No data loss, full recoverability.
Action Logging
All deletions/moves are logged to file for recovery and audit.
Undo Functionality
Log file can be used to restore accidentally deleted files (limited undo window).
Error Handling
Permission Errors
- Clear error message
- Suggest running with sudo
- Skip files that can't be accessed
File Lock Errors
- Detect locked files
- Skip and report
- Suggest closing applications using files
Space Errors
- Check available disk space before deletion
- Warn if space is critically low
- Prevent disk-full scenarios
Troubleshooting
Not Finding Expected Duplicates
- Check detection method (content vs size vs name)
- Verify exclude patterns aren't too broad
- Check if files are in whitelisted directories
- Try with includeSubdirs: false
Deletion Not Working
- Check write permissions on directories
- Verify action isn't 'delete' with autoConfirm: true
- Check size threshold isn't blocking all deletions
- Check file locks (is another program using files?)
Slow Scanning
- Reduce includeSubdirs scope
- Use size-based detection (faster)
- Exclude large directories (node_modules, .git)
- Process directories individually instead of batch
Tips
Best Results
- Use content-based detection for documents (100% accurate)
- Run dry-run first to preview changes
- Archive instead of delete for important files
- Check logs if anything unexpected deleted
Performance Optimization
- Process frequently used directories first
- Use size threshold to skip large media files
- Exclude hidden directories from scan
- Process directories in parallel when possible
Space Management
- Regular duplicate cleanup prevents storage bloat
- Delete temp directories regularly
- Clear download folders of installers
- Empty trash before large scans
Roadmap
- [ ] Duplicate detection by image similarity
- [ ] Near-duplicate detection (similar but not exact)
- [ ] Duplicate detection across network drives
- [ ] Cloud storage integration (S3, Google Drive)
- [ ] Automatic scheduling of scans
- [ ] Heuristic duplicate detection (ML-based)
- [ ] Recover deleted files from backup
- [ ] Duplicate detection by file content similarity (not just hash)
License
MIT
Find duplicates. Save space. Keep your system clean. 🔮