Back in July, a MMO-RPG game from Amazon, New World suddenly started killing graphics cards out of thin air. A number of GPUs were affected but the most prominent among those was definitely the RTX 3090, and in particular, EVGA‘s RTX 3090s. At the time, as expected, gamers went into a frenzy worrying over their expensive hardware as a simple beta of an upcoming game had supposedly killed their shiny new RTX 3090. Now, speaking to PCWorld, a spokesman from the company has given clarification on the matter.
The story so far
When the epidemic of dying GPUs started, the general consensus surrounding the reason behind it was pointing towards uncapped frames. See, most games these days limit the frame rate in menus. So, when you load into a game, it doesn’t suddenly jump up to 500FPS then dip back down to 100 when you’re actually in the game. New World’s beta didn’t have that. Sudden jumps in frame rates stressed the card to the point where the cooling couldn’t keep up with its demands, or so was assumed.
After reports became widespread, New World developer, Double Helix Games, rushed out an update that capped the frame rate in menus which seemed to fix the problem. There were essentially no reports of bricked GPUs after this. EVGA also stated that the issue was limited to only RTX 3090s even though there were complaints registered against other GPUs such as RTX 3080s, and AMD Radeon units. Neither Amazon nor the game developer claimed responsibility for this and even released a statement describing how it wasn’t their fault.
Once the issue was acknowledged by EVGA, they supposedly accepted all RMAs and sent the affected customers a replacement card without even waiting for the broken card to ship to them first. EVGA would not disclose the total number of 3090s sold in contrast to the ones affected. But they did confirm that “less than 1%” of the cards from the total yield were bricked. It’s also noteworthy that EVGA charged extra to accelerate RMA requests as compared to the MSRP at the height of this situation.
It’s not the fan controller
Many thought that the micro-controller of the fans was giving out and could not keep up with the amount of frames being generated. Supposedly, the fan-controller was not working properly so when temperatures on the card suddenly jumped up due to high frames in menus, the fans would spin at insanely high RPMs to cool down the GPU. This was thought to be the cause of failure behind the cards failing.
EVGA, however, dismissed this claim entirely and gave another explanation for why the fan-controller was being shown as broken in various monitoring software. What actually happened was, the i2c bus on the PCB was creating noise but hardware monitoring tools such as HWInfo could not pick that up properly. They were, instead, reporting the noise as fans spinning at unbelievably high RPMs, aka a fan-controller failure. EVGA’s own Precision X1 software was, however, reported it correctly.
Since this issue was highlighted, EVGA has already released an update to their micro-controllers that coordinate with third-party software better, showing the correct fan RPMs. If you were one of the handful that were affected, EVGA recommends downloading the Precision X1 software. Once installed, it will let you update the micro-controller easily. Furthermore, make sure the monitoring software you’re using is up to date as well.
So, what the heck really killed all those 3090s?
Internal investigation at EVGA revealed that cards produced the year prior, in 2020, had “poor workmanship” on the PCB around the MOSFET circuits. EVGA was able to come to this conclusion by putting the two dozen faulty cards they received under an X-Ray machine and analyzing them thoroughly. Moreover, the spokesman stated that the fan controller was not at fault here, as we already discussed.
EVGA says it worked in tandem with Nvidia and Amazon to acquire the pre-capped version of New World which had not implemented the frame rate limiter – the version that killed all those poor 3090s. The manufacturer tried to replicate the issue but couldn’t but still tried to create a similar environment within the GPU and the game that mimicked those of the victimized. That’s how EVGA was able to uncover the real culprit behind the problem.
Quality Control concerns
Now that it’s clear that poor PCB design was the reason for all of this horror, it begs the question: why didn’t EVGA fix this before? EVGA cared enough to change the PCB design starting 2021 which means they were aware of the loose tolerances around the electrical circuits. But instead of revising the cards themselves, they chose to sit quiet. And it was only when tragedy struck that EVGA decided to RMA the cards and compensate the affected users.
Putting all of that into context, it paints a clear picture that EVGA knew but did not care. They sold the units knowing that they had design flaws. Even if we consider that EVGA was unaware of this for a moment, it then speaks for their Quality Control. A issue that existed last year was unable to be highlighted until almost a year later. All those cards passed QC tests and were shipped to customers with a major problem, only to be rectified months later.