The Challenges of Implementing Data Synchronization
Jul 20, 2023
Data synchronization is essential for any software that needs to store data on multiple devices. In case of {CodeStore}, the goal was to offer a platform where users could access their code snippets online and offline, across multiple devices (by now, the service is not available anymore). This required a complex synchronization algorithm that could identify changes made on the server and the local file system and ensure that both were up-to-date.
In this article, I will cover the challenges and considerations I encountered while implementing the synchronization algorithm for {CodeStore}. I hope to provide some valuable insights and a better understanding of how data synchronization works.
What is data synchronization?
Data synchronization refers to the process of ensuring that data across multiple devices, systems, or databases is consistent and up-to-date. In today's interconnected world, where data is stored and accessed from various sources and devices, it is essential to keep information consistent and accurate across all platforms. The need for data synchronization arises in several scenarios, such as distributed systems, mobile devices and cloud storages, collaborative access and offline access.
One-Way vs. Two-Way Synchronization
There are two types of synchronization algorithms: one-way and two-way synchronization. In one-way synchronization, the changes made on one device are pushed to another device, but changes made on the second device are not reflected on the first one. In contrast, two-way synchronization involves both devices updating and sending data back and forth.
Two-way synchronization is more complex than one-way synchronization due to potential conflicts that can occur when updates are made to the same data on multiple devices. Detecting and resolving these conflicts is critical to maintain data integrity and prevent data loss.
Identifying Changes
Identifying changes in data is a crucial aspect of data synchronization, particularly in two-way synchronization. It involves recognizing and tracking modifications made to data across all devices to ensure consistency. This process also plays an important role in conflict resolution, as knowing what changes were made and when helps to decide which version to keep in case of conflicting changes.
The biggest challenge in change detection is not only determining updated data, but also detecting the addition or removal of data. If certain data is present on one device but not on the other, the system must be able to distinguish whether the data was created on the first device or removed from the second.
Data synchronization systems must therefore implement a mechanism to automatically detect changes and take appropriate actions. This can be done by storing metadata such as timestamps or version numbers for each piece of data, which are then compared when synchronizing the data between devices.
Transfer Delta
This concept involves only transferring the differences between two datasets, instead of sending the entire dataset. This is especially useful when a large amount of data needs to be transferred, as it minimizes network traffic and speeds up the synchronization process.
However, implementing this strategy introduces additional complexity. Each device must maintain a comprehensive record of all data changes, including creation, modification, and deletion. While tracking new or updated data is relatively straightforward using timestamps, monitoring data deletion is a greater challenge. It requires accurately identifying which data was deleted and when. Failing to do that can result in deleted data being mistaken for newly created data, potentially compromising security by exposing confidential information to unauthorized devices.
Transfer Entire State
This approach requires sending all data across the devices for synchronization. It´s commonly used when the system doesn't have detailed information about the historical changes that have been made.
Using this strategy can have a drawback of being much more time-consuming compared to the delta technique as it involves transferring and processing a large amount of data. However, the advantage is that it doesn't require tracking all changes made. Instead, it only requires the current state of the devices involved to identify changes.
Nevertheless, this approach still requires further information to find out whether specific data was added or deleted on a device. This challenge can be resolved by preserving a snapshot of the complete state as it existed after the most recent synchronization. That way, the synchronization algorithm is able to identify any modifications on either device by comparing the current data with the snapshot of the last synchronized state. You can find a more detailed description of this solution in a blog article by Markus Unterwaditzer.
In many cases, there is no need to store the complete data in a snapshot. Typically, only specific meta information is required for comparison. This reduces memory usage and simplifies complexity, resulting in a more efficient synchronization process.
Conflict Resolution
Conflicts arise when multiple users make simultaneous changes to the same data or when a user makes changes on multiple devices that are not immediately reflected on other devices due to network or connectivity issues. The synchronization algorithm plays a crucial role in detecting and resolving these conflicts. This process is essential to maintain data integrity and consistency.
There are several strategies that can be adopted to resolve conflicts in data synchronization:
Three-Way Merge
This strategy involves comparing the two versions of the data that are conflicting, along with the original version before any changes were made. The algorithm can then identify the parts of the data that were changed by each user and attempt to merge them together. While this strategy can be done automatically in most cases, it requires manual intervention if the same parts of the data were changed.
Timestamp-Based Resolution / Last Write Wins
This strategy is relatively simple and involves automatically accepting the most recent change made to the data. While this method is straightforward, it also runs the risk of possibly overwriting significant changes made by another user.
Mutual Exclusion
This strategy prevents data conflicts from occurring in the first place by temporarily locking the access to the data while it is being updated. This ensures that only one user can modify a certain piece of information at any given time, preventing discrepancies or overwritten changes.
Mutual exclusion can be categorized into two common types of locks: a) read-locks (shared locks): This lock allows multiple users to read a resource, but not to modify it. As soon as the lock has been acquired, all attempts to modify the data by other users are blocked until the lock is released. b) write-locks (exclusive locks): This type of lock allows only one user to modify and read the data at any given time. All other users are not able to access the data until the lock has been released.
Manual Conflict Resolution
If automated strategies cannot resolve the conflict, the system may prompt users to manually resolve the conflict. This approach is the most reliable, but it also requires the most effort from the user.
Conclusion
In conclusion, data synchronization is a crucial aspect of modern distributed systems. Various strategies, ranging from conflict detection and resolution to mutual exclusion mechanisms, ensure that data remains consistent and reliable across different devices and systems. The three-way merge and manual conflict resolution methods offer solutions when conflicts arise, while mutual exclusion, with its read-locks and write-locks, proactively prevents conflicts. Understanding and implementing these strategies effectively can enhance the efficiency of data processing and improve the user experience by ensuring that data is current, accurate, and accessible when needed.
The {CodeStore} synchronization library
During my research on data synchronization, I also set out to find an existing solution that included a basic synchronization algorithm. Such a solution should be able to detect conflicts while providing flexibility in data storage and presentation. Unfortunately, my search only led me to solutions that are closely tied to specific cloud services, such as Nextcloud and Google Cloud.
Frustrated by the amount of work required to implement my own solution, I wanted to offer others the ability to effortlessly create a synchronization feature in their own applications. So I extracted the algorithm I developed for {CodeStore} into a standalone, open source Java library. This library works independently of third-party services and gives programmers the freedom to individually design conflict resolution and data storage.
I will provide a detailed tutorial about that library in a future blog post.