Sunday, February 01, 2009
Synchronizing Directories And Files With A USB Drive
If you are interested in this script, there is an arguably better version, which uses find, instead of recursion in bash, to traverse the directories. It is much faster. (You can obtain a copy here, Bash script to synchronize directories).
Every time that I check on the price of USB drives, it seems that the amount of storage that one can buy for $10 has doubled! Moore's Law seems to be a little accelerated for USB drives...
At this rate, in two years (say 2011) 256 GB USB drives will cost $10.
So, like many people, I store more and more information on USB drives.
And, like many people, I then rapidly run into the problem of keeping directory trees synchronized. It is actually a difficult problem, because although you know from the file timestamps which files are the latest files on the USB, you do not necessarily know the history of the files and directories. So if a file is deleted on the USB but still exists on the hard drive, what do you do? You either remove the file on the hard drive, or create the file on the USB, but knowing which action is the correct one is difficult. As you can create, modify, and remove files using a variety of programs, capturing the history necessary to synchronize two directories trees is difficult too. One solution might be to intercept all the OS calls to the file systems involved, but that seems to be a lot of work.
There are a variety of programs which set out to provide directory synchronization. Two of the most well known are rsync and unison. They are both well worth a look. Rsync in particular is very effective. However, for the application of keeping my USB drive and hard drive in synchronization, I wanted something which I could adjust a little more than rsync, and so I have been using the script below. This is still rather experimental, and comes with no guarantees whatsoever. If you try it, you need to take appropriate precautions for yourself.
The script allows the user to enter a 'modification window' in seconds. This allows latitude in the assessment of the file timestamps that are used in deciding whether to update files in the target directory. This is needed because a 'FAT' USB drive stores file timestamps at a lower resolution than either Windows or Linux typical file systems. For a FAT device you will probably want to supply '-t 2' to insure that you don't end up copy lots of files in either direction when the files are actually supposed to have the same timestamp in reality.
As mentioned, this script is still experimental. I use it with a directory of around 1.5 GB of files, which I synchronize between two computers and a USB drive. The performance is the primary concern, although it is certainly usable. The shell script uses 'stat' (a lot) to obtain information on the modification timestamps of the files that it needs to compare. I have been considering replacing this with a single find command to obtain this information in one shot upfront (the command will be something like 'find . -printf "%p\t%T@\n"'). Perhaps this will be the subject of a future script.
If you have any comments or questions, please let me know.
Newer Posts
Older Posts
Every time that I check on the price of USB drives, it seems that the amount of storage that one can buy for $10 has doubled! Moore's Law seems to be a little accelerated for USB drives...
At this rate, in two years (say 2011) 256 GB USB drives will cost $10.
So, like many people, I store more and more information on USB drives.
And, like many people, I then rapidly run into the problem of keeping directory trees synchronized. It is actually a difficult problem, because although you know from the file timestamps which files are the latest files on the USB, you do not necessarily know the history of the files and directories. So if a file is deleted on the USB but still exists on the hard drive, what do you do? You either remove the file on the hard drive, or create the file on the USB, but knowing which action is the correct one is difficult. As you can create, modify, and remove files using a variety of programs, capturing the history necessary to synchronize two directories trees is difficult too. One solution might be to intercept all the OS calls to the file systems involved, but that seems to be a lot of work.
There are a variety of programs which set out to provide directory synchronization. Two of the most well known are rsync and unison. They are both well worth a look. Rsync in particular is very effective. However, for the application of keeping my USB drive and hard drive in synchronization, I wanted something which I could adjust a little more than rsync, and so I have been using the script below. This is still rather experimental, and comes with no guarantees whatsoever. If you try it, you need to take appropriate precautions for yourself.
The script allows the user to enter a 'modification window' in seconds. This allows latitude in the assessment of the file timestamps that are used in deciding whether to update files in the target directory. This is needed because a 'FAT' USB drive stores file timestamps at a lower resolution than either Windows or Linux typical file systems. For a FAT device you will probably want to supply '-t 2' to insure that you don't end up copy lots of files in either direction when the files are actually supposed to have the same timestamp in reality.
As mentioned, this script is still experimental. I use it with a directory of around 1.5 GB of files, which I synchronize between two computers and a USB drive. The performance is the primary concern, although it is certainly usable. The shell script uses 'stat' (a lot) to obtain information on the modification timestamps of the files that it needs to compare. I have been considering replacing this with a single find command to obtain this information in one shot upfront (the command will be something like 'find . -printf "%p\t%T@\n"'). Perhaps this will be the subject of a future script.
If you have any comments or questions, please let me know.
#!/bin/bash
# comparefiles either compares two files and returns (if in compare mode)
# or determines whether to update the target file and carries out the update
comparefiles () {
FILE1="$1/$3"
FILE2="$2/$3"
if [ $COMPARE = "Y" ]; then
if [ ! -f "$FILE2" ] ; then
echo "dirsync: warning $FILE2 does not exist"
NCOPY=`expr $NCOPY + 1`
else
diff "$FILE1" "$FILE2" > /dev/null
if [ $? != 0 ]
then
echo "dirsync: $FILE1 $FILE2 differ"
NCOPY=`expr $NCOPY + 1`
fi
fi
return
fi
if [ ! -f "$FILE2" ] ; then
NCOPY=`expr $NCOPY + 1`
if [ $DRYRUN = "Y" ]; then
echo "dirsync: need to /bin/cp -a -i $1/$ITEM $2/$ITEM"
else
echo "dirsync: copying new item $1/$ITEM to $2/$ITEM"
/bin/cp -a -i "$1"/"$ITEM" "$2"/"$ITEM"
fi
return
fi
FILETIME1=`stat -c'%Y' "$FILE1"`
FILETIME2=`stat -c'%Y' "$FILE2"`
TIMEDIFF=`expr $FILETIME1 - $FILETIME2`
NEGTIMEWINDOW=`expr -$TIMEWINDOW`
if [ $TIMEDIFF -gt $TIMEWINDOW ]; then
echo "dirsync: (t=$TIMEDIFF) copying file $1/$ITEM to $2/$ITEM"
echo "dirsync: $FILE1: `stat -c'%s %y' "$FILE1"`"
echo "dirsync: $FILE2: `stat -c'%s %y' "$FILE2"`"
NCOPY=`expr $NCOPY + 1`
if [ $DRYRUN = "Y" ]; then
echo "dirsync: need to chmod u+w $2/$ITEM"
echo "dirsync: /bin/cp -a $1/$ITEM $2/$ITEM"
else
chmod u+w "$2"/"$ITEM"
/bin/cp -a "$1"/"$ITEM" "$2"/"$ITEM"
fi
elif [ $TIMEDIFF -lt $NEGTIMEWINDOW ]; then
echo "dirsync: warning newer file in target TIMEDIFF: " $TIMEDIFF
echo "dirsync: $FILE1: `stat -c'%s %y' "$FILE1"`"
echo "dirsync: $FILE2: `stat -c'%s %y' "$FILE2"`"
echo "dirsync: diffing files"
diff "$FILE1" "$FILE2" > /dev/null
if [ $? == 0 ]; then
echo "dirsync: the files are the same - update target"
echo "dirsync: requires /bin/cp -a $1/$ITEM $2/$ITEM"
NCOPY=`expr $NCOPY + 1`
if [ $DRYRUN != "Y" ]; then
/bin/cp -a "$1"/"$ITEM" "$2"/"$ITEM"
fi
else
echo "dirsync: files differ"
echo "dirsync: requires /bin/cp -a -i $1/$ITEM $2/$ITEM"
NCOPY=`expr $NCOPY + 1`
if [ $DRYRUN != "Y" ]; then
/bin/cp -a -i "$1"/"$ITEM" "$2"/"$ITEM"
fi
fi
fi
}
searchdir () {
if [ $COMPARE = "Y" ]; then
echo "dirsync: comparing $1 and $2"
fi
if [ ! -d "$2" ]; then
if [ $DRYRUN = "Y" ]; then
echo "dirsync: need to mkdir $2"
else
mkdir "$2"
fi
fi
for ITEM in "$1"/*
do
ITEM=`basename "$ITEM"`
if [ -h "$1"/"$ITEM" ]; then
echo "dirsync: $1/$ITEM is a link and links are not handled"
elif [ -f "$1"/"$ITEM" ]; then
comparefiles "$1" "$2" "$ITEM"
NFILE=`expr $NFILE + 1`
elif [ -d "$1"/"$ITEM" ]; then
searchdir "$1"/"$ITEM" "$2"/"$ITEM"
NDIRS=`expr $NDIRS + 1`
fi
done
for ITEM in "$2"/*; do
ITEM=`basename "$ITEM"`
# the check on the existence of the second item handles the wild card
if [ ! -e "$1"/"$ITEM" -a -e "$2/$ITEM" ]; then
if [ -d "$2/$ITEM" ]; then
echo "dirsync: directory $2/$ITEM does not exist in $1"
else
echo "dirsync: File $2/$ITEM does not exist in $1"
fi
if [ $DRYRUN = "Y" ]; then
echo "dirsync: need to rm -ri $2/$ITEM (if -f is set)"
fi
if [ $CLEANUP = "Y" ]; then
echo "dirsync: rm -ri $2/$ITEM"
rm -ri "$2"/"$ITEM"
fi
NDELE=`expr $NDELE + 1`
fi
done
}
NDIRS=0
NFILE=0
NCOPY=0
NDELE=0
NDIFF=0
TIMEWINDOW="0"
CLEANUP="N"
DRYRUN="N"
COMPARE="N"
if [ "$#" -lt 2 ]; then
echo "Usage: dirsync source target [-f | -d | -k] [-t offset]"
echo "-f = force removal of files deleted in source (cleanup)"
echo "-d = dry run"
echo "-t offset = offset in seconds to apply to timestamps (for FAT)"
echo "-k = comparison"
exit
fi
# the first two arguments are directories
while [ $# -gt 0 ]; do
if [ "$1" = "-t" ]; then
TIMEWINDOW="$2"
shift; shift; continue
elif [ "$1" = "-d" ]; then
DRYRUN="Y"
shift; continue
elif [ "$1" = "-f" ]; then
CLEANUP="Y"
shift; continue
elif [ "$1" = "-k" ]; then
COMPARE="Y"
shift; continue
else # target directories are stored here
if [ -z "$SRC" ]; then
SRC="$1"
else
TRG="$1"
fi
shift; continue
fi
done
if [ -z "$SRC" -o -z "$TRG" ]; then
echo "Target directories not supplied"
exit 1
fi
if [ ! -d "$SRC" -o ! -d "$TRG" ]; then
echo "Either $SRC or $TRG is not a directory, stopping"
exit 1
fi
if [ $COMPARE = "Y" -a $CLEANUP = "Y" ]; then
echo "Compare (-k) not permitted with cleanup (-f)"
exit 1
fi
if [ $COMPARE = "Y" -a $DRYRUN = "Y" ]; then
echo "Compare (-k) not permitted with dryrun (-d)"
exit 1
fi
if [ $DRYRUN = "Y" -a $CLEANUP = "Y" ]; then
echo "Dryrun (-d) not permitted with cleanup (-f)"
exit 1
fi
searchdir "$SRC" "$TRG"
echo ""
echo "dirsync: number of directories searched = $NDIRS"
echo "dirsync: number of files checked = $NFILE"
if [ $DRYRUN = "Y" ]; then
echo "dirsync: number of files to be copied = $NCOPY"
echo "dirsync: number of items to be deleted = $NDELE"
elif [ $COMPARE != "Y" ]; then
echo "dirsync: number of files copied = $NCOPY"
fi
if [ $CLEANUP = "Y" ]; then
echo "dirsync: number of items deleted = $NDELE"
fi
if [ $COMPARE = "Y" ]; then
echo "dirsync: number of files that differ $NDIFF"
fi