How I Started Listing Corrupted Snapshots in OpenSearch (DEV Tools, Curl, and Python)
When I first started working with OpenSearch snapshots, everything seemed straightforward. I could create snapshots, restore them, and automate daily backups to S3 without much trouble. But one day, I encountered an unexpected error while trying to restore a snapshot. It said the snapshot was missing or corrupted. That moment marked the beginning of a deep dive into how OpenSearch handles snapshots, why they get corrupted, and how to detect them efficiently.
In this tutorial, I’ll walk you through how I learned to identify corrupted snapshots using OpenSearch DEV Tools, Curl commands, and finally Python automation. This guide is beginner-friendly, so even if you’re just getting started with OpenSearch, you can follow along easily.
Understanding Snapshots in OpenSearch
Before we start detecting corrupted snapshots, it’s important to understand what a snapshot actually is.
In OpenSearch, a snapshot is a backup of your cluster’s indices and state. It’s usually stored in a repository such as an Amazon S3 bucket, shared filesystem, or any custom location you define. These snapshots are used for disaster recovery and data migration.
A typical snapshot repository structure contains metadata files for each snapshot, index data, and shard information. Every snapshot you take is incremental, meaning only the changes since the last snapshot are stored.
Why Snapshots Get Corrupted
Corrupted snapshots are more common than you might think. In my experience, they usually occur due to one of the following reasons:
1. Missing Metadata Files
Sometimes, OpenSearch cannot find the snapshot metadata file. This can happen if files are manually deleted from the repository or if there’s an issue with S3 synchronization.
2. Interrupted Snapshot Process
If the cluster shuts down or a network failure occurs while taking a snapshot, OpenSearch might leave an incomplete or partial snapshot.
3. Repository Inconsistency
In cases where multiple clusters share the same repository without proper isolation, the metadata can get mixed up, leading to missing references.
4. S3 or Storage Errors
Occasionally, S3 or another storage service might return transient errors during upload, causing incomplete uploads that lead to corrupt snapshots.
Detecting Corrupted Snapshots in OpenSearch
When I first faced the issue, I tried checking snapshot statuses through OpenSearch Dashboard’s Snapshot Management page. However, everything looked fine there. The snapshot status was showing as SUCCESS, even though it failed to restore.
That’s when I learned that just because a snapshot’s status is “SUCCESS” doesn’t mean it’s complete or healthy. You need to dive deeper into the snapshot’s /_status API to verify its integrity.
Let’s explore how to do that using different methods.
Checking Snapshot Status Using OpenSearch DEV Tools
The simplest way to start is by using the DEV Tools console in OpenSearch Dashboards.
Step 1: List All Snapshots
Use the following command to list all snapshots in a given repository:
GET _cat/snapshots/my-snapshot-repo?v
This will display all snapshots along with their status (SUCCESS, PARTIAL, or FAILED).
Step 2: Check Snapshot Status
If you want to check the detailed status of a specific snapshot, you can use:
GET _snapshot/my-snapshot-repo/snapshot_name/_status
If the snapshot is missing or corrupted, you’ll see an error similar to this:
{
"error": {
"type": "snapshot_missing_exception",
"reason": "[my-snapshot-repo:snapshot_name] is missing",
"caused_by": {
"type": "no_such_file_exception",
"reason": "Blob object not found: The specified key does not exist."
}
},
"status": 404
}
This message means OpenSearch cannot locate the metadata file for that snapshot. It’s either missing or corrupted.
Detecting Corrupted Snapshots Using Curl
Sometimes I prefer running commands directly from the terminal. You can use Curl to perform the same checks.
Step 1: List Snapshots
curl -X GET "https://test.rootsaid.com/dashboards/_cat/snapshots/my-snapshot-repo?v"
Step 2: Check Individual Snapshot Status
curl -X GET "https://test.rootsaid.com/dashboards/_snapshot/my-snapshot-repo/snapshot_name/_status"
If you see a 404 error or a “Blob object not found” message, it’s a clear indicator of corruption or missing metadata.
However, checking each snapshot manually is time-consuming, especially if you have hundreds of them. That’s where automation comes in.
Automating Corruption Detection with Python
After checking manually for a few days, I decided to automate the entire process. I wrote a Python script that would:
- Fetch the list of all snapshots from the repository.
- Check the /_status of each snapshot.
- Identify missing or corrupted ones.
- Generate a summary report.
Here’s a simplified version of what that script looked like.
import requests
import json
os_endpoint = "https://test.rootsaid.com/dashboards/"
repo_name = "my-snapshot-repo"
def list_snapshots():
url = f"{os_endpoint}_cat/snapshots/{repo_name}?h=id&format=json"
response = requests.get(url, verify=False)
return [snap["id"] for snap in response.json()]
def check_snapshot(snapshot):
url = f"{os_endpoint}_snapshot/{repo_name}/{snapshot}/_status"
response = requests.get(url, verify=False)
if response.status_code != 200:
print(f"Snapshot {snapshot} seems corrupted or missing")
else:
data = response.json()
if data["snapshots"][0]["stats"]["failed"] > 0:
print(f"Snapshot {snapshot} has failed shards")
snapshots = list_snapshots()
for s in snapshots:
check_snapshot(s)
This script automatically detects corrupted snapshots by looping through all existing ones and checking their health individually.
Handling Timeouts and Scaling Up
When I first ran this script on my production environment, it timed out. I realized that when you have hundreds of snapshots, checking them one by one sequentially takes too long.
To fix that, I added parallel processing using Python’s ThreadPoolExecutor, allowing multiple snapshots to be checked at once. This drastically reduced the total execution time.
I also added a whitelist for older snapshots I didn’t want to check again, along with retry logic to handle temporary network issues.
Preventing Snapshot Corruption
Once I understood why corruption happens, I started taking a few preventive measures:
1. Use Dedicated Repositories
Avoid using the same repository for multiple clusters.
2. Monitor Repository Health
Periodically list snapshots and check their statuses to detect early signs of corruption.
3. Enable Retry and Timeout Handling
When creating snapshots through automation, use retry mechanisms for transient network or S3 errors.
4. Run Periodic Cleanup
Use the _snapshot/_cleanup API to remove dangling or unreferenced files.
Conclusion
Detecting corrupted snapshots in OpenSearch might sound complex, but once you understand how the system works, it becomes much easier. I started with basic commands in DEV Tools, moved to Curl, and finally automated the entire process with Python.
If you’re managing large OpenSearch repositories, setting up a script like this can save you hours of manual work and help you catch snapshot issues before they cause data loss.
By understanding why snapshots fail and proactively monitoring them, you can ensure that your OpenSearch backups remain healthy, consistent, and ready when you need them.
