Saturday, January 13, 2007

 

Removing duplicated files - keeping your files tidy with filetidy

Frequently you find that files have been duplicated on your various machines. This happens when trees of files are moved from machine to machine and work begins to diverge within these trees. Rather than manually reconcile such work activities (which is slow and difficult) - it is often useful to rapidly find duplicated files - information on file duplicates gives you a sense of directories or folders that can be deleted - or you can simply remove the duplicates automatically. Here is a script, called filetidy, makes use of find, sort, cksum and awk to automate the analysis of duplicate files.

It works in in the following manner: a long listing of file information is created using find - this information is sorted, and then files which are the same size (and therefore could be duplicates), are tested for similarity using cksum. The output is in the form of a list of diff commands and a list of commented 'rm' commands. You can use the output to confirm that files are indeed duplicates - and then once you have decided which files to retain - to delete the duplicates.



#!/bin/sh
# 1. find files only and report long listing
# 2. sort based on size field
# 3. process same size files using awk and cksum
# 4. output a script which diffs files for confirmataion and
# can delete files with editing

find -type f -ls | sort -k7 | \
awk 'BEGIN{
prevsize=-1
ncount=0
}
function ckfile(filename, cmd)
{
if (length(ck[filename])==0){
cmd="cksum " filename
cmd | getline ckout
close(cmd)
split(ckout, array," ")
ck[filename]=array[1]
}
return ck[filename]
}
{
filesize = $7
for(i=1;i<=10;i++){ # remove all fields except the filename
$(i)="";
}
file = $0
gsub("\\$", "\\$", file) # deal with dollars in filename
gsub("\\(", "\\(", file) # and parentheses
gsub("\\)", "\\)", file)
if(match(file,"&")) next; # avoid files with ampersands
if(match(file,"\047")) next; # avoid files with apostrophes
sub("^[ \t]*", "", file) # remove leading white space
ncount++
filelistsize[ncount]=filesize
filelistname[ncount]=file
}
END{
i=1
while(i<=ncount){
filelistname[i]
j = i+1
while( filelistsize[j] == filelistsize[i] && j <= ncount ){
if ( ckfile(filelistname[i]) == ckfile(filelistname[j]) ) {
if ( ck[filelistname[i]] != oldck ) {
if ( first == 1 ) print ""
oldck = ck[filelistname[i]]
first = 1
}
if( !visited[filelistname[j]] ){
visited[filelistname[j]]++
fn = filelistname[j]
print "diff " filelistname[j] " " \
filelistname[i] " # " filelistsize[j]
print "#if [ $? == 0 ] ; then rm -f " fn "; fi"
}
}
j++
}
i++
}
}'

Usage is typically:


filetidy.sh | tee tmp.txt

Examine tmp.txt to confirm that the duplicates idenfited makes sense, then remove some of the commented 'rm' commands, and remove the duplicated files.


source tmp.txt

Labels:

Comments:
Seems to me readers return to a blog with a focused theme.

Writing from a readers point of view is important, and some photos, livlier format also helps readers stay with a blog, see wht I mean:

assertivenesssucceeds.blogspot.com

browniesforbreakfast.blogspot.com

good luck

dave
 
Post a Comment



<< Home
Newer Posts Older Posts