Thinking Out Loud - File copy tool arguments

A side project I've been working on for some time now is a decentralized-distributed file copy tool in the spirit of the typical cp command you can find on just about any unix style system out there.

When I start a project like this, I tend to concentrate on what the most difficult problem is -- once I solve the most difficult problem, I can then easily finish designing the rest of the system without too much effort. However, the file copy tool caught me off guard a bit with how complex the argument handling turned out.

Of course, the code to distribute the workload across hundreds of cores (proven to scale past 100,000 cores) was the most difficult. However, it's basically a solved problem for this application as far as this application goes. Take a look at libcircle to see how this is done.

Since the hardest problem in the system is basically solved, we can move on to the second hardest problem — the creation an easy to use front-end for users to chunk up and copy files. I choose the familiar POSIX-style interface since most people are already trained to use it.

A subtle problem in creating this frontend is dealing with the many combinations of directories and files that the user may use for input. To keep things simple for this blog post, I'm going to ignore everything but simple files and directories. Maybe in a later post, I'll cover how block and character devices, local domain sockets, named pipes, and symbolic links should be handled.

To give an idea of what we're dealing with, here's the usage message for my tool. I've named it dcp, short for "distributed copy program" (see dcp).

For all input, we need to know what the base name of the directory is that we'll be writing files into. We'll also need the path of the destination and a list of source paths. There are also a few "impossible" situations, like trying to copy a directory into a file. We need to prune out these situations and present a nice error message to the user.

To figure out what needs to be pruned out, first we need to know what we have. This is a bit tricky because sometimes the destination does not initially exist based on what the user is trying to do. However, we're not in the business of creating new directories, so it's safe to say that the destination should be a single file or directory if the source is a single file or directory. However, sometimes an error condition will pop up when it doesn't make sense for multiple source paths to be copied into a file.

Here's some pseudocode to demonstrate this concept. First, we check to see if something exists on disk at the destination path. If it does, we remember that state for later.


If the destination path doesn't exist, we check to see what the source paths are. If recursion is turned on, we'll have a file as the destination if the source is a single file, otherwise, we need the destination to be a directory.

Now that we know what the last argument should be, we can reason about what the end result should be. This makes it trivial to prune out the impossible situations. Writing down all potential input combinations yields the following impossible conditions: copying one or more directories into a file, copying many files into a file, copying one or more directories and files into a file. Encountering any of this input will lead to an error condition.

Take note that all of the impossible conditions we've listed have the property of the destination being a file. We can take advantage of this by catching many of the error conditions in the logic that determines if we're in the mode of copying a single file into another file.

Now that we've handled everything but copying source files into directories, we can easily handle the rest of the potential inputs.

So there we have it, an algorithm to handle user input of directories and files for a file copy tool. Stay tuned for the initial stable release of dcp.