Writing S3 Sync in GoLang

One of the great challenges in software engineering is that you don’t know what the problem is until you’ve worked through it one. Sometimes this is the idea that you need to top down design a program which is a great start, but what’s interesting is when it comes to the language itself and what it makes easy and hard.

My latest project was a making a Go version of the python s3cmd, there is now the golang s3-cli which supports most all of the basic file oriented commands that s3cmd has. While most commands are pretty easy “ls” for example, the challenging command is the “sync” command.

The evolution of a sync command from basic approach to golang.

This is is a variation of the original code that was written for the sync command. As you can see it’s not complicated but if you wanted to use goroutines or other ways to improve performance you’re going to start splicing in a uglyness.

 1// Retrieve a list of files from the source
 2src_files = getFiles(src)
 3
 4// Retrieve a list of files from the destination
 5dst_files = getFiles(dst)
 6
 7// Now figure out the differences
 8for _, file := range src_files {
 9    if in_slice(file, dst_files) {
10       // check to see if it's the same (size/checksum) copy if needed
11    } else {
12       copyFile(...)
13    }
14}
15
16for _, file := range dst_files {
17    if !in_slice(file, src_files) {
18       // destination file exists, remove since not in source
19    }
20}

Version 2 – using queues

This version did exist, what you quickly see is that you’re creating a work queue with information. The idea was that it was a good way to start measuring the work to be done. How many bytes to transfer, show progress and percent completion since you’re not blocked on doing work in parallel to fetching data.

 1type Action {
 2   src   string   // source file name
 3   dst   string   // destination file name
 4   action int     // action (COPY, DELETE, CHECKSUM)
 5}
 6
 7// work queue
 8queue := make([]Action, 0)
 9
10// Retrieve a list of files from the source
11src_files = getFiles(src)
12
13// Retrieve a list of files from the destination
14dst_files = getFiles(dst)
15
16// Now figure out the differences
17for _, file := range src_files {
18    dst_path = transformPath(file)
19    if in_slice(dst_file, dst_files) {
20       // check to see if it's the same (size/checksum) copy if needed
21    } else {
22       queue = append(queue, Action{ src: file, dst: dst_path, action: COPY })
23    }
24}
25
26for _, file := range dst_files {
27    if !in_slice(file, src_files) {
28       queue = append(queue, Action{ dst: file, action: REMOVE })
29    }
30}
31
32for _, action := range queue {
33   // do work, shared out with accounting.
34}

GoLang Native Sync

What’s great about evolving programs is that you originally start out with one idea and through a bit of cleanup and refactoring you realize that there is a more natural way of doing things in a language. It would be “easy” to take the queue that you just built above and start farming work out to workers. What’s really interesting is to realize that the queue could just be a set of channels.

So a more natural go program is going to look like (note the actual code is more nuanced).

 1type Action {
 2   src   string   // source file name
 3   dst   string   // destination file name
 4   action int     // action (COPY, DELETE, CHECKSUM)
 5}
 6
 7// Create some channels
 8chanCopy := make(chan Action, 1000)
 9chanRemove := make(chan Action, 1000)
10
11// Create a few workers
12for i := 0; i < 4; i++ {
13   go workerCopy(chanCopy)
14}
15go workerRemove(chanRemove)
16
17
18// Retrieve a list of files from the source
19src_files = getFiles(src)
20
21// Retrieve a list of files from the destination
22dst_files = getFiles(dst)
23
24// Now figure out the differences
25for _, file := range src_files {
26    dst_path = transformPath(file)
27    if in_slice(dst_file, dst_files) {
28       // check to see if it's the same (size/checksum) copy if needed
29    } else {
30       chanCopy <- Action{ src: file, dst: dst_path, action: COPY }
31    }
32}
33
34for _, file := range dst_files {
35    if !in_slice(file, src_files) {
36       chanRemove <- Action{ dst: file, action: REMOVE }
37    }
38}

What’s really great is that now instead of creating a queue with work, we move the queue to a channel. We only have to start a set of workers to work on doing the copies. This is also now more natural go code.