Spoiler: I’ve information from the story within the title of this put up, it is principally what I anticipated it to be, I’ve simply added it to HIBP the place I’ve known as it “Knowledge Troll”, and I’ll give everybody much more context under. Right here goes:
Headlines one-upping one another on the variety of passwords uncovered in an information breach have turn out to be considerably of a sport lately. Every new story desires to current a quantity that surpasses the earlier story, and the clickbait cycle continues. You may see it coming a mile away, and also you simply know the fact is considerably lower than the headline, however how a lot much less?
And so it was in June when a narrative with this title hit the headlines: 16 billion passwords uncovered in record-breaking information breach. I believed this might be one other normal run-of-the-mill sensational headline that may catch a couple of eyeballs for a few days then be forgotten, however no, apparently not. It began with an enormous quantity of curiosity in Have I Been Pwned:
That is Google searches for my “little” venture, which I discovered odd, as a result of we hadn’t put any information in HIBP! However that preliminary story gained a lot traction and entered the mainstream media to the extent that many publications directed individuals to HIBP, and inevitably, there was a bunch of looking achieved to determine what the service truly was. And the information is nonetheless popping out – this story landed on AOL simply final week:
You realize it is severe due to all of the purple and exclamation marks… however per the article, “you needn’t panic”
Sufficient speculating, let’s get into what’s truly in right here, and for that, I went straight to the supply:
Bob is a top quality researcher who has been very profitable over time at sniffing out breached information, a few of which had beforehand ended up in HIBP on account of his good work. So we had a chat about this trove, and the very first thing he made clear was that this is not a single supply of publicity, however somewhat totally different infostealer information units which were publicly uncovered this 12 months. The headlines implying this was a large breach are deceptive; stealer logs are produced from individually compromised machines and infrequently bundled up and redistributed. Bob additionally identified that most of the information units had been not uncovered, and he did not have a duplicate of all of them. However he did have a subset of the info he was joyful to ship over for HIBP, so let’s analyse that.
All instructed, the info Bob despatched contained 10 JSON recordsdata totalling 775GB throughout 2.7B rows. An intial cursory verify towards HIBP confirmed greater than 90% of the e-mail addresses had been already in there, and of people who had been in earlier stealer logs, there was a excessive correlation of matching web site domains. What I imply by that is that if the info Bob despatched had somebody’s e-mail deal with and password captured when logging into Netflix and Spotify, that individual was most likely already in HIBP’s stealer logs towards Netflix and Spotify. In different phrases, there’s so much of information we have seen earlier than.
So, what will we make of all this, particularly for the reason that corpus Bob despatched is about 17% of the reported 16B headline? Let me communicate usually about how these information units are inclined to have hyperbolic headlines, and the numbers of precise influence are approach smaller:
- There’s normally duplication throughout recordsdata, as the identical information seems a number of occasions
- There’s additionally usually duplication inside the identical file, once more, as the identical information reappears
- A “row” is an occasion of somebody’s e-mail deal with and password listed subsequent to an internet site they’re logging onto, so 100 distinct rows could all be one individual
The corpus of information I obtained contained 2.7B rows, of which I used to be capable of extract 325M distinctive stealer log entries. That is the variety of rows I might efficiently parse out web site, e-mail deal with and password values from. In my earlier instance with the one individual’s credentials captured for each Netflix and Spotify, that may imply two distinctive stealer log data. All of this then distilled right down to 109M distinctive e-mail addresses throughout all of the recordsdata, and that is the quantity you may now see in HIBP. In different phrases, 2.7B -> 109M is a 96% discount from headline to individuals. May we apply the identical maths to the 16B headline? We’ll by no means know for positive, however I betcha the lower is even larger; I doubt further corpuses to the tune of that many billion would proceed so as to add new e-mail addresses, and the duplication ratio would improve.
As a result of it all the time comes up after loading stealer logs, a fast caveat:
Not all e-mail addresses loaded into this breach will comprise corresponding stealer log entries. It’s because we have now one course of to regex out all of the addresses (the code is open supply), and one other course of that pulls rows with e-mail addresses towards legitimate web sites and passwords.
And since I will find yourself copying and pasting this over and over in responses to queries, one other caveat:
Presence in a stealer log is usually an indicator of an contaminated gadget, however we have now no information to point when it was contaminated. There will probably be a variety of previous information in right here, simply as there’s a variety of repackaged information.
Of the passwords in legitimate stealer log entries, there have been 231M distinctive ones, and we might seen 96% of them earlier than. These at the moment are all in Pwned Passwords with up to date prevalence counts and are searchable by way of the web site and, in fact, by way of the API. Talking of which, these passwords are presently being searched so much:
Each time I look, there’s one other billion (or two) pic.twitter.com/X7gflzWdCH
— Troy Hunt (@troyhunt) July 30, 2025
Of the 109M e-mail addresses we might parse out of the corpus, 96% of them had been already in HIBP (that quantity coincidentally matches the proportion of present passwords we observe). They weren’t all from earlier stealer logs, in fact, however anecdotally, throughout my testing, I discovered a variety of crossover between this one and the ALIEN TXTBASE logs from earlier this 12 months. Regardless, we added 4.4M new addresses from Knowledge Troll that we might by no means seen earlier than, in order that alone is important. Not vital sufficient to justify hyperbolic headlines to the impact of “largest ever”, however nonetheless sizeable.
To summarise:
- The 16B headline distils right down to a a lot smaller variety of distinctive values of precise influence
- The information is essentially from stealer logs which were circulating for a while now
- It is actually not recent and does not pose any new dangers that weren’t already current
And lastly, there’s that “Knowledge Troll” title. Once I first noticed this story getting a lot traction, the picture I had in my thoughts was of a troll sitting on stashes of information. The mass media then picked this up and turned it into intentionally provocative headlines, manipulating the narrative to hunt consideration. Hopefully, this put up tempers all that just a little bit and brings some sanity again into the dialogue. We have to take information exposures like this critically, however it actually did not deserve the eye it received.
#Billion #Password #Story #AKA #Knowledge #Troll
admin, the author behind This Blog, is a passionate tech enthusiast with a keen interest in exploring and sharing insights about the rapidly evolving world of technology.
With a background in Blogging, admin brings a unique perspective to the blog, offering in-depth analyses, reviews, and thought-provoking articles. Committed to making technology accessible to all, i strives to deliver content that not only keeps readers informed about the latest trends but also sparks curiosity and discussions.
Follow me on this exciting tech journey to stay updated and inspired.