寫了一個找重複文件的 Bash 腳本,通過比較文件大小和校验和來判斷文件是否(可能)是重複的:

程序

#!/usr/bin/env bash

## Summary: find duplicate files
## Meng Lu <lumeng.dev@gmail.com>

DIR=${1:-`pwd`} ## use provided path if available, otherwise the current path

FILENAME=`basename $0`

TMPFILE=`mktemp /tmp/${FILENAME}.XXXXXX` || exit 1

## one-line version
#find -P . -type f -exec cksum '{}' \; | sort | tee $TMPFILE | cut -f 1-2 -d ' ' | uniq -d | grep -if - $TMPFILE | sort -nr -t' ' -k2,2 | cut -f 3- -d ' ' | while read line; do ls -lhta "$line"; done

## multi-line version with comments
find -P . -type f -exec cksum '{}' \; | # find non-directory files and compute their checksum; -P: never follow symbolic links
sort | # sort by {checksum, file size, file name}
tee $TMPFILE | # save a copy in a temporary file and pass along
cut -f 1-2 -d ' ' | # keep only the checksum and file size
uniq -d | # remove uniq ones
grep -if - $TMPFILE | # greps from previously saved file list the lines of duplicate files identified by having same file size and checksum; - is from redirecting stdout to stdin
sort -nr -t' ' -k2,2 | # sort by descending file size
cut -f 3- -d ' ' | # keep only file name
while read line; do ls -lhta "$line"; done # do informative ls on all found duplicate files

GitHub 存檔

註釋

  • find -P . -type f -exec cksum '{}' \;
    • -P 不找符號鏈接文件(symbolic links);
    • -type f 找文件而非文件夾;
    • -exec cksum '{}' \; 對每個找到的文件('{}')計算校驗和,cksum 輸出校驗和 文件大小 文件名,其中文件大小是八進制數個數;
  • sort 排序,爲 uniq 做準備;
  • tee $TMPFILEstdout 流的內容一方面保存到臨時文件,一方面繼續沿着 pipe 傳遞到下游;
  • cut -f 1-2 -d ' ' 只保留第1、2欄,欄目以空格分;
  • uniq -d 刪除唯一的亦即無重複的行;
  • grep -if - $TMPFILE 通過 - 將輸出流轉換爲輸入流,在預存的文件目錄中找重複文件的{校驗和,文件大小}出現的行,注意,這裏的行包含文件名;
  • sort -nr -t' ' -k2,2 對找出的重複的文件按大小降序排序;
  • cut -f 3- -d ' ' 之保留保留文件名;
  • while read line; do ls -lhta "$line"; done 對每一文件打印詳細信息。

相關文章

  • find 應用實例
  • grep 應用實例
blog comments powered by Disqus