A library filled with books with a blue cover.

The {CodeStore} Synchronization Library

Nov 03, 2023

In a previous blog article, I wrote about the challenges of implementing data synchronization.

Frustrated by the amount of work required to implement a synchronization algorithm, I wanted to offer others the ability to effortlessly create a synchronization feature in their own applications. So I extracted the algorithm I developed for {CodeStore} into a standalone, open source Java library.

In this tutorial, I will walk you through the steps to use the {CodeStore} synchronization library, so you will be able to easily add a synchronization function to your own application.

Fundamentals of the Synchronization Algorithm

The synchronization algorithm synchronizes two sets of arbitrary items. The representation of the items, as well as the access to the corresponding storage is defined by the main application.

The items can be anything. A simple string, an image or a complex data structure.

Also, the storage of the items can be arbitrary. It can be in memory, a file system, or a relational database. Furthermore, the storage of both item sets doesn't need to be homogeneous. For example, you could synchronize data stored on the local file system with a remote, relational database how it's done in {CodeStore}. But you can also synchronize the data of two different cloud providers like Google Cloud and NextCloud.

In addition to the item sets to synchronize (let's call them "A" and "B"), there is a third item set called "status". The status contains the information of which items were present on all systems after the most recent synchronization. This way, the algorithm can distinguish between added and deleted items.

For example, if an item is present in the item set A, but not in B and the status, the item must have been created on A and therefore must be added to B. In turn, if the item is also present in the status, that means it was present on B after the last synchronization and must have been removed from B. Thus, it has to be removed from A as well.

A quantity chart showing how new or deleted files can be detected.

You can find more details about the algorithm in the original blog article of Markus Unterwaditzer.

The Sample Application

I have three cats and therefore a lot of pictures of them in a local folder and on a USB stick. I want to implement an application which synchronizes both folders so that my local folder and the USB stick are up-to-date.

Let's create a simple Maven application right away and add the {CodeStore} synchronization library as dependency.

<?xml version="1.0" encoding="UTF-8"?>
<project>
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>synchronization</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>cloud.codestore</groupId>
            <artifactId>synchronization</artifactId>
            <version>1.1.0</version>
        </dependency>
    </dependencies>
</project>

Implement ItemSets and Status

As previously mentioned, the synchronization library uses two sets of items and a status for the synchronization. Thus, we need to implement the corresponding interfaces ItemSet and Status. For convenience, the library provides several base classes for this, which we'll use.

Before implementing the mentioned interfaces, we need to answer a few questions first:

  1. What is an item?
  2. How do we represent an item?
  3. How do we identify an item?
  4. Is an item immutable or mutable?

In this case, an item is a file on the file system, represented by the corresponding path. Assuming that each file has a unique name, a file can be identified by its name. Additionally, we assume that the files are immutable in this first, simple implementation. We get to the more complex case of mutable files later.

ItemSet

As our items are files, let's create an ItemSet called FileSet. Also, let's use the AbstractImmutableItemSet base class as it provides some basic functionality. In the first step, we collect the names of all files in the specified directory and pass them to the base class.

public class FileSet extends AbstractImmutableItemSet<Path> {

    private final Path directory;

    public FileSet(Path directory) throws IOException {
        super(getFileNames(directory));
        this.directory = directory;
    }

    @Override
    public Path getItem(String fileName) {
        return directory.resolve(fileName);
    }

    @Override
    public void addItem(String fileName, Path sourceFile) throws Exception {
        Path targetFile = directory.resolve(fileName);
        Files.copy(sourceFile, targetFile);
    }

    @Override
    public void delete(String fileName) throws Exception {
        Path file = directory.resolve(fileName);
        Files.deleteIfExists(file);
    }

    private static Set<String> getFileNames(Path directory) throws IOException {
        try (var files = Files.list(directory)) {
            return files.filter(Files::isRegularFile)
                        .map(Path::getFileName)
                        .map(Path::toString)
                        .collect(Collectors.toSet());
        }
    }
}

Status

The status can be stored in any arbitrary format which is specified by the main application. But since it only contains the file names, we can save it as a simple CSV file. Luckily, the {CodeStore} synchronization library provides a corresponding helper class called CsvImmutableItemStatus. So, there is no work to do for us here.

Synchronize the Items

Now, we are already ready to implement the main class and run the synchronization.

Note that we reuse the FileSet class for both folders. But if your files are saved on different systems, you will most likely have to implement two different ItemSets to access the corresponding files.

Note that we need to explicitly save the status after the synchronization was finished. In this simple case, we don't use multithreading, so the synchronize() call returns after all files were synchronized.

public class FileSynchronization {
    public static void main(String[] args) throws IOException {
        ItemSet<Path> home = new FileSet(Path.of("home", "images"));
        ItemSet<Path> usb = new FileSet(Path.of("usb", "images"));
        Status status = CsvImmutableItemStatus.loadSilently(Path.of("home", "status.csv"));

        Synchronization<Path> synchronization = new ImmutableItemSynchronization<>(home, usb, status);
        synchronization.synchronize();
        status.save();
    }
}

Logging the Synchronization Progress

The synchronization algorithm magically does its job, but we don't get any feedback whether the synchronization was successful. For this, we have to set a ProgressListener which is called whenever the synchronization of an item started, finished successfully, or failed.

public class FileSyncProgressListener implements ProgressListener {
    @Override
    public void numberOfItems(int numberOfItems) {
        System.out.println("Synchronizing " + numberOfItems + " files ... ");
    }

    @Override
    public void synchronizationStarted(String fileName) {
        System.out.print("Synchronizing " + fileName + " ... ");
    }

    @Override
    public void synchronizationFinished(String fileName) {
        System.out.println("finished successfully.");
    }

    @Override
    public void synchronizationFailed(String fileName, Throwable exception) {
        System.err.println("failed with error: " + exception.getMessage());
        exception.printStackTrace();
    }
}
Synchronization<Path> synchronization = new ImmutableItemSynchronization<>(home, usb, status);
synchronization.setProgressListener(new FileSyncProgressListener());
synchronization.synchronize();

This leads to the following output on the console:

Synchronizing 5 files ...
Synchronizing img1.jpg ... finished successfully.
Synchronizing img2.jpg ... finished successfully.
Synchronizing img3.jpg ... finished successfully.
Synchronizing img4.jpg ... finished successfully.
Synchronizing img5.jpg ... finished successfully.

Synchronizing Mutable Items

So far, we assumed that the images we want to synchronize will never be modified. But we want to have the freedom to modify images and have the changes available in the other folder. So, we need to find a way to detect changes.

The {CodeStore} synchronization library can do that by using etags. An etag is an arbitrary string that defines a specific version of a file. Since etags are defined and interpreted by the main application, it can be anything that helps the application to identify changes. It may be a simple hash of the file, or a timestamp.

In our case, a simple MD5 hash of the files is enough to check whether it changed. That means, we have to change the ItemSet and Status implementations. Again, the {CodeStore} synchronization library provides corresponding base classes for this.

Note that instead of passing the file names to the AbstractMutableItemSet base class, we pass a map which maps the file names to the corresponding etag/hash. For hashing the file, I use the Apache Commons Codec library here.

public class FileSet extends AbstractMutableItemSet<Path> {

    private final Path directory;

    public FileSet(Path directory) throws IOException {
        super(getFileNamesAndHashes(directory));
        this.directory = directory;
    }

    @Override
    public Path getItem(String fileName) {
        return directory.resolve(fileName);
    }

    @Override
    public void addItem(String fileName, Path sourceFile) throws Exception {
        Path targetFile = directory.resolve(fileName);
        Files.copy(sourceFile, targetFile);
    }

    @Override
    public void delete(String fileName) throws Exception {
        Path file = directory.resolve(fileName);
        Files.deleteIfExists(file);
    }

    @Override
    public void updateItem(String fileName, Path sourceFile) throws Exception {
        Path targetFile = directory.resolve(fileName);
        Files.copy(sourceFile, targetFile, StandardCopyOption.REPLACE_EXISTING);
    }

    private static Map<String, String> getFileNamesAndHashes(Path directory) throws IOException {
        try (var files = Files.list(directory)) {
            return files.filter(Files::isRegularFile)
                        .collect(Collectors.toMap(
                                file -> file.getFileName().toString(),
                                FileSet::md5
                        ));
        }
    }

    private static String md5(Path file) {
        try {
            return DigestUtils.md5Hex(Files.newInputStream(file));
        } catch (IOException exception) {
            throw new RuntimeException(exception);
        }
    }
}

Our main class doesn't change much. We simply update the Status and Synchronization objects to their "mutable" counterparts.

public class FileSynchronization {
    public static void main(String[] args) throws IOException {
        ItemSet<Path> home = new FileSet(Path.of("home", "images"));
        ItemSet<Path> usb = new FileSet(Path.of("usb", "images"));
        Status status = CsvMutableItemStatus.loadSilently(Path.of("home", "status.csv"));

        Synchronization<Path> synchronization = new MutableItemSynchronization<>(home, usb, status);
        synchronization.setProgressListener(new FileSyncProgressListener());
        synchronization.synchronize();
        status.save();
    }
}

Conflict Resolution

The thing that makes synchronizing mutable files more complex than immutable ones, is the fact that files may be changed in both folders, which leads to conflicts. The {CodeStore} synchronization library is able to detect those changes based on the etag, but it cannot decide which file to copy to the opposite directory. This is done by the main application, which means we have to implement a ConflictResolver.

The ConflictResolver is free to use any strategy to resolve the conflict. For example, the {CodeStore} application uses the timestamp of a code snippet as etag and thus, is able to use the most recent snippet. In this case, we simply let the user decide whether to use the local file, or the file on the usb stick. In a more user-friendly, real-world application, I would most likely display a dialog showing the corresponding images. But here, a simple command line query does the trick.

public class FileSyncConflictResolver extends ConflictResolver<Path> {
    @Override
    public void resolve(String fileName, String localFileHash, String usbFileHash) throws Exception {
        System.out.println("There is a conflict for file " + fileName);
        System.out.println("Would you like to keep the local file? [y/n]");

        String input = new Scanner(System.in).next();
        switch (input.toLowerCase()) {
            case "y" -> applyItemA(); //override file on usb
            case "n" -> applyItemB(); //override local file
            default -> throw new UnresolvedConflictException();
        }
    }
}

Note that the synchronization library doesn't know what systems you want to synchronize. So it refers to them simply as "A" and "B". In our case, "A" is the local folder and "B" is the usb stick. That means, when calling applyItemA() the image of the local folder is copied to the usb stick.

Note that if the ConflictResolver is not able to resolve the conflict for some reason, it must throw an UnresolvedConflictException as shown above.

Now it's time to test our implementation. So I edited one of the images in both folders to produce a conflict and set the ConflictResolver to the Synchronization object.

Synchronization<Path> synchronization = new ImmutableItemSynchronization<>(home, usb, status);
synchronization.setProgressListener(new FileSyncProgressListener());
synchronization.setConflictResolver(new FileSyncConflictResolver());
synchronization.synchronize();

When we now execute the synchronization, we get the following output:

Synchronizing 5 files ... 
Synchronizing img1.jpg ... finished successfully.
Synchronizing img2.jpg ... finished successfully.
Synchronizing img3.jpg ... finished successfully.
Synchronizing img4.jpg ... finished successfully.
Synchronizing img5.jpg ... There is a conflict for file img5.jpg
Would you like to keep the local file? [y/n]
y
finished successfully.

Cancellation

The synchronization process can be quite time-consuming when synchronizing a large amount of files. In this case, we want to have the possibility to cancel the synchronization. For this case, the {CodeStore} synchronization library offers the cancel() method which has to be called by another thread like the GUI-thread of the application.

Note that cancelling does not interrupt the currently processed file. It only prevents the following files from being processed.

The synchronize() method waits for the current file to be processed and returns afterwards. To check whether the synchronization was executed completely or was canceled, we can use the isCanceled() method.

Synchronization<Path> synchronization = new ImmutableItemSynchronization<>(home, usb, status);
synchronization.synchronize();
System.out.println("The synchronization " + (synchronization.isCanceled() ? "was canceled." : "finished successfully."));

Concurrent Processing

By default, all files are processed one after the other. The synchronize() method returns as soon as all files were processed or the synchronization was canceled.

In our simple folder synchronization, this is no problem. We only have a few images and the modern flash memories are quite fast. But you will surely come into a situation where you need to synchronize thousands of files, or you need to access a remote system over a potentially slow connection. For this case, the {CodeStore} synchronization library offers the possibility to use multiple threads which can improve the performance of the synchronization significantly.

Let's assume we have a lot of files to synchronize and want to use multiple threads to speed up the synchronization. We simply tell the synchronization library to use ten threads by calling the setThreadCount(int) method. Now, a thread pool of ten threads is created that are used for the synchronization. This means that up to ten files are processed at the same time.

Synchronization<Path> synchronization = new ImmutableItemSynchronization<>(home, usb, status);
synchronization.setThreadCount(10);
synchronization.synchronize();

It could be that simple, but in fact, it's not. When using concurrent processing, you need to make sure that all your ItemSet, Status, ProcessListener and ConflictResolver implementations are thread safe!

Summary

Data synchronization is essential for any software that needs to store data on multiple devices or systems. This requires a complex synchronization algorithm that is able to identify changes and ensures that all systems are up-to-date.

In this tutorial, I showed you how you can implement a simple synchronization application using the {CodeStore} synchronization library. In the simplest case, the files are unmodifiable which requires you to only implement a single class. Furthermore, I covered the more complex case of modifiable files and conflict handling.

If you find any bugs or feel that a useful feature is missing, please feel free to create a pull request!