Saturday, January 13, 2007
Removing duplicated files - keeping your files tidy with filetidy
Frequently you find that files have been duplicated on your various machines. This happens when trees of files are moved from machine to machine and work begins to diverge within these trees. Rather than manually reconcile such work activities (which is slow and difficult) - it is often useful to rapidly find duplicated files - information on file duplicates gives you a sense of directories or folders that can be deleted - or you can simply remove the duplicates automatically. Here is a script, called filetidy, makes use of find, sort, cksum and awk to automate the analysis of duplicate files.
It works in in the following manner: a long listing of file information is created using find - this information is sorted, and then files which are the same size (and therefore could be duplicates), are tested for similarity using cksum. The output is in the form of a list of diff commands and a list of commented 'rm' commands. You can use the output to confirm that files are indeed duplicates - and then once you have decided which files to retain - to delete the duplicates.
#!/bin/sh
# 1. find files only and report long listing
# 2. sort based on size field
# 3. process same size files using awk and cksum
# 4. output a script which diffs files for confirmataion and
# can delete files with editing
find -type f -ls | sort -k7 | \
awk 'BEGIN{
prevsize=-1
ncount=0
}
function ckfile(filename, cmd)
{
if (length(ck[filename])==0){
cmd="cksum " filename
cmd | getline ckout
close(cmd)
split(ckout, array," ")
ck[filename]=array[1]
}
return ck[filename]
}
{
filesize = $7
for(i=1;i<=10;i++){ # remove all fields except the filename
$(i)="";
}
file = $0
gsub("\\$", "\\$", file) # deal with dollars in filename
gsub("\\(", "\\(", file) # and parentheses
gsub("\\)", "\\)", file)
if(match(file,"&")) next; # avoid files with ampersands
if(match(file,"\047")) next; # avoid files with apostrophes
sub("^[ \t]*", "", file) # remove leading white space
ncount++
filelistsize[ncount]=filesize
filelistname[ncount]=file
}
END{
i=1
while(i<=ncount){
filelistname[i]
j = i+1
while( filelistsize[j] == filelistsize[i] && j <= ncount ){
if ( ckfile(filelistname[i]) == ckfile(filelistname[j]) ) {
if ( ck[filelistname[i]] != oldck ) {
if ( first == 1 ) print ""
oldck = ck[filelistname[i]]
first = 1
}
if( !visited[filelistname[j]] ){
visited[filelistname[j]]++
fn = filelistname[j]
print "diff " filelistname[j] " " \
filelistname[i] " # " filelistsize[j]
print "#if [ $? == 0 ] ; then rm -f " fn "; fi"
}
}
j++
}
i++
}
}'
Usage is typically:
filetidy.sh | tee tmp.txt
Examine tmp.txt to confirm that the duplicates idenfited makes sense, then remove some of the commented 'rm' commands, and remove the duplicated files.
source tmp.txt
Labels: bash awk cygwin
Writing from a readers point of view is important, and some photos, livlier format also helps readers stay with a blog, see wht I mean:
assertivenesssucceeds.blogspot.com
browniesforbreakfast.blogspot.com
good luck
dave
<< Home
