Fixing a broken Defcoin mining pool; a saga

Follow along in my journey of fixing a broken NOMP/MPOS Defcoin mining pool. It wasn’t a public pool, it was my own personal solo mining pool. The idea was that it would eventually become public, but you know how it is, sometimes it takes time to get around to doing things. Doing something for me is easy, doing it for public requires much more careful thought and planning.

Careful thought and planning that I wasn’t executing last year, sometime between February and April, when I haphazardly ran an apt upgrade on the Ubuntu 18.04 VM that was running my pool. I didn’t think anything of it. It was a busy time. I was in Vegas for a while in February, then I came home and went to BSides Nova, then the world shut down and mining Defcoin was just not on my mind.

I noticed that my wallet wasn’t getting fatter, so I logged in to take a look, and realized it was 100% out of space. The shares table was 3GB. It wasn’t important at the time, so I abandoned it in place.

Cut to this week, when bashNinja and others are talking about doing some work on Defcoin. Pools are popping up, people are getting excited again, there’s talk of forks, and I’m right there paying attention, because sure, I want my pool up and running again. But man, I’m not looking forward to figuring out this software made of black magic and rickety scaffolding and held together with government cheese. I barely got it running the first time, I clearly didn’t understand it.

So, reluctantly, I started digging. First, my defcoin core wallet is not talking to any peers. It only has one peer address and it can’t connect. Well, it has been a year. I asked on bashninja’s discord about that, and got a quick and easy response. I was pointed to a post in /r/defcoin that contained a list of peers that can be manually added via the defcoin-qt debug console window. Once I did that, it started to talk to peers again, and began to wriggle its way towards 2021 on its own time.

Second, the shares table. Nothing can work with that table in that state, everything’s just running too damn slow. 17 million rows. So…

Let’s clear that table. At this point I have no idea whether it will prevent the rest of the system from working. [Keep in mind that I never gained a full understanding of how the system is strung together, I just got it working and let it go. So at this point, I’m reverse engineering something I slapped together myself.] But in case I need it, I’ll back it up. So… create table shares_manual_backup like shares; insert into shares_manual_backup select * from shares; Then, once I confirmed everything copied, delete every row from shares. This allowed me to navigate, and allowed the WebGUI to respond again. I needed that, there’s valuable troubleshooting info hiding in there.

So browsing around the GUI, I see that all the cron jobs have been disabled. It took me a while to remember where to find and fix that. I don’t know why an interface wasn’t created for it. How it works, I learned, is that if one of the cron jobs and their subtasks fail, they update or add a row in the monitoring table to indicate that the job is disabled, then they no longer run from cron, forcing the administrator to address the underlying issue before it gets worse.

I tried enabling them and running them, they just revert back to disabled. So I dug around to find where MPOS logs results of those cron jobs, and I found them. /home/(username)/mpos/logs/(jobname)/log_(date)etc. I found very strange results in those log files. Problems with scripts that I hadn’t changed. Curiouser and curiouser.

So again this took a while, but eventually I happened up on a clue. A script failing because a command had been deprecated in PHP 8. So now it’s starting to dawn on me that my update might have caused this. Also, it’s having trouble finding memcached, which I know is installed. I don’t quite understand, until…

OK, I’ll add a phpinfo file to the public-facing web area of MPOS. Go to it. Sure enough, no memcached. But wait. This says we’re running PHP 7.3. How can this be? Back to command-line. php -v shows PHP 8.0. What is this trickery??? OK. Since the problem is clearly in the command-line, because the cron jobs are failing, let’s try backing this version down to PHP 7.3. This can be done with update-alternatives.

That worked. Now we’re getting different errors.

021-04-14 18:53:49 - ERROR --> Failed to update share ID in database for block 1273177: SQL Query failed: 2006
2021-04-14 18:53:49 - ERROR --> Failed to update worker ID in database for block 1273177: SQL Query failed: 2006
2021-04-14 18:53:49 - ERROR --> Failed to update share count in database for block 1273177: SQL Query failed: 2006
2021-04-14 18:53:49 - CRIT --> E0005: Unable to fetch blocks upstream share, aborted:Unable to find valid upstream share for block: 1273178
2021-04-14 18:53:49 - INFO --> |    23103 |    1273178 |           24.75 |            |                           | []              |                 |          any_share |
2021-04-14 18:53:49 - ERROR --> Failed to update share ID in database for block 1273178: SQL Query failed: 2006
2021-04-14 18:53:49 - ERROR --> Failed to update worker ID in database for block 1273178: SQL Query failed: 2006
2021-04-14 18:53:49 - ERROR --> Failed to update share count in database for block 1273178: SQL Query failed: 2006

OK, this makes sense. Of course it can’t associate share IDs with blocks, I’ve wiped out the share table! So let’s look closer at the share table, because I’m really hesitant to dump 17 million records back in. Looking closer at the data, the way it associates a share ID with a block is the “solution” field in the shares table, which maps to the “blockhash” field in the blocks table. A couple of quick count queries reveal that of the 17 million records in the shares table, currently relocated, fewer than 7,000 contain a populated solutions field. Those are the shares that resulted in a blockhash. So, on a hunch, I select just those rows back into shares and run the findblocks command again. Lo and behold, it’s not failing. It’s taking its time, though. About two seconds for every three records. So roughly this “fix,” assuming it works, will take a while.

I let it run for a while, and then I tentatively give the pps-payout script a poke, since that’s another one that was failing instantly because it wasn’t finding any shares that matched its criteria. Sure enough, it’s able to chew on the data that findblocks is now fixing. Good.

So the way the scripts are re-enabled is, you fix the underlying problem, then run the script with the -f argument. If it succeeds, it re-enables the cron job. So it’s important to check that, because any problem can cause a cascade of further problems that eventually kill the system.

I probably won’t know until midnight tonight whether I’m finished with my NOMP/MPOS deep dive, but I will sleep well knowing that I’ve taken it far from the broken state it was in, and I’ve learned a lot along the way. Oh, and I documented everything I found in my personal Gitlab issues and Wiki for the project, so even if I unlearn it, it’ll be less painful next time.

One Reply to “Fixing a broken Defcoin mining pool; a saga”

  1. Just in case anyone happens upon this because they’re experiencing similar pain or trying to diagnose pool issues, I have a minor update. The cron jobs were continuing to be disabled for various reasons. Once all the catching up is done, it seems the only one continuing to disable itself was the payouts cron job. The problem is, if that doesn’t run, everything grows and gets stupid. It’s really the core.

    So my problem was that the payout transfer was continuously failing, either for an INVALID AMOUNT error or for an error stating that the wallet had to be unlocked with a passphrase before issuing the transfer.

    The INVALID AMONT error seems to be an issue with the line that assigns the payment minus the transaction fee, somewhere just before line 191 in mpos/cronjobs/payouts.php. The line looks like this by default:

    $aSendMany[$aUserData[‘coin_address’]] = $aUserData[‘confirmed’] – $config[‘txfee_manual’];

    I saw some posts suggesting it was related to precision, i.e. the RPC function can’t handle more than 8 decimal points of precision, and that rounding it to 8 decimal points would solve the problem, but I wasn’t able to get that rounding in place correctly, I started getting other “not a string” type errors, so I separated the rounding and calcuation into a separate variable, then specified that variable instead of the calcuation in the rpc_txid function. Note that there are numerous instances of this function in the payouts file depending on whether this is an auto payout vs a manual payout, etc. I added diagnostic info to each of the logError statements for each instance, so I’d know which one specifically was being executed. Since I was working on auto payouts, I ended up correcting the instance near line 181. I say near because I added comments that shifted line numbers a bit.

    I also found references for the wallet passphrase error, and what some folks have said was that this doesn’t work on an encrypted wallet. While it can be made to work, obviously, but wedging in a wallet unlock command, that’s probably about as unsafe as having an unencrypted wallet, so you’re on your own for making decisions on how to solve that in your case, should you come across it.

Comments are closed.