|
Rob Thompsonrob.sun3.org |
However, synchronizing huge file systems across networks can cause some issues. It will work every time, but the speed in which it completes it's initial "receiving file list..." operation can sometimes take an entire day to complete if you have a file system with 3.2 million small files on either side like one that I work with.
The RSYNC daemon/client negotiates a list of files to "deal with" when you first start the operation. This initial negotiation runs at a constant pace, in my case around 10,000 files/minute. Not at all bad, but when considering 3,200,000 files, this equates to 5.3 hours. Then if you use the --delete function, this is doubled since for some reason it does the same operation on your local receiving side after the remote file list is gathered.
This does increase the amount of CPU and I/O that both your sending and receiving side use, but I've been able to run ~25 parallel instances without remotely degrading the rest of the system or slowing down the other RSYNC instances.
The key is to use the --include and --exclude command line switches to create selection criteria.
drwxr-xr-x 2 root root 179 Jul 19 16:22 directory_a
drwxr-xr-x 2 root root 179 Aug 12 00:08 directory_b
#!/bin/bash
rsync -av --include="/directory_a*" --exclude="/*" --progress remote::/ /localdir/ >
/tmp/myoutputa.log &
rsync -av --include="/directory_b*" --exclude="/*" --progress remote::/ /localdir/ >
/tmp/myoutputb.log &
#!/bin/bash
rsync -av --progress remote::/ /localdir/ > /tmp/myoutput.log &
At some point, I'd like to see the RSYNC daemon have an option to automatically split up the workload into manageable chunks at the expense of your CPU and I/O. However, I don't know exactly how this would be accomplished since RSYNC would somehow need to know what the directories looked like before it started it's process.
One option I could think of would be to add a feature to RSYNC to gather, save and use a "statistics file". So, it could generate this statistics file on a remote directory tree, which would save general information like how many files are in each part of the tree. Then it could read this file to decide how to split up the file gathering tasks in order to optimizing time. This file could then be reused during subsequent operations. Of course this would only work if data was not constantly moving around between those directories and they stayed relatively proportional with respect to each other over time. This could also be dealt with entirely outside of the RSYNC process, perhaps in a script that would gather statistics and fork RSYNC processes based on what it finds.
Well, until then, the above method works great for me.
Comments (3) 08/18/2007 04:34pm