With all the scary stuff you’re hearing about in the news this week, I thought I’d inject a little bit of light-hearted storytelling.
Long ago and far away, I inherited a network. Then I was tasked with relocating it to a new room. This was successful, and everyone lived happily ever after.
Until I checked in on it later and discovered that the backups were failing. Not only were they taking days to complete (or fail), but the restore points were becoming corrupted, which takes more time to repair, on top of an already excruciatingly slow (6mpbs) backup.
I looked at networking, I looked at server bottlenecks, I manually deleted restore points to eliminate that extra delay of rebuilding corrupted points. I was truly confused. So I looked deeper. Fearing a drive media failure, I looked at the device from which the backup drive was shared.
That’s when it hit me. The “backup” VM on which I was looking to determine the location of the network share — was NOT the same server as the backup server from which I was administering the backups via the web.
Looking closer, I discovered that the backups were running on TWO separate backup server. And yes, you guessed it. To the SAME Nakivo backup repository. Or even worse, to two identical configurations of “the same” repository. Disastrous. Backups were stepping on each other, corrupting each other, and slowing each other down. It seems the engineer who built the network was unhappy with performance on one server and just descheduled the jobs and built a newer, faster server to run the backups. After the move, I guess I came across this one instead of the correct one, and re-enabled the jobs, thinking they had been disabled for the move.
The moral of the story is this. When you migrate backups from one server to another because of speed, don’t just unschedule the jobs, because someone may reschedule them in the future. Take the extra step of deleting or disabling the jobs on the outgoing server, or do what I did after resolving this debacle — Since I couldn’t disable the old backup web interface (for reasons), I added a fake job with no targets, called “DONT-RUN-JOBS-HERE” to remind someone who happens upon it in the future, and updated the “where is everything” document to point to the newer location.