Fehérke's GitHub Site == glue between Fehérke's GitHub hosted projects

duplicated.sh - search for duplicated files

An e-friend wanted to search for duplicated files on his machine, but did not know how. He wrote me this : “Although I know you have a shell script for this purpose, that does not help me too much on xp.” Two mistakes : 1) no, I did not had such script; 2) yes, shell scripts can help on market leader operating system too.

So I wrote duplicated.sh and run it in CygWin. Works perfectly. Sadly, I can not tell the same about my other two tries :

duplicated.sh had to deal with 27.67 Gb data. At the end it displays the execution time. And I am satisfied by it.

Usage

sample output - duplicated.sh --help

duplicated.sh   version 1.4   august 2008   written by Feherke
search for multiple files with the same content

Syntax :
  duplicate.sh [-s size] [-a algorithm] [directory [...]]

Parameters :
  -s size | --size=size  - file size in bytes, smallers are not checked ( 0 )
  -a algorithm | --algo=algorithm  - checksum algorithm : MD5 or SHA1 ( MD5 )
  directory  - directory with path to include in the search ( . )

All searches are recursive.

It only generates a duplication list, dealing with the duplicates is left to the user :

sample output - duplicated.sh

duplicated.sh   version 1.2   march 2007   written by Feherke
searches for multiple files with the same content
creating temporary directory... Ok ( tmp.oDYUuq2932 )
creating file list... Ok ( 890340, error 7707 )
searching for duplicated file sizes... Ok ( 25028 )
creating list of potential duplicated files... Ok ( 872781 )
collecting MD5 checksums... Ok ( 871992, error 789 )
searching for duplicated checksums... Ok ( 98577 )
creating list of duplicated files... Ok ( 725121 )
creating result list... Ok ( duplicate.txt )
cleaning up temporary data... Ok ( 352M )
all done in 1 hours 50 minutes 28 seconds.

The numbers mentioned as errors above, are the caused by files and directories not accessible for any kind of user.

Configuration

None.

The only way duplicated.sh’s activity can be influenced, are the command line parameters.

Versions

Plans

Download

You can find the related files on GitHub in my Bash-script repository’s duplicated directory :