Presently sponsored by: 1Password Extended Access Management: Secure every sign-in for every app on every device.

Begging for Bounties and More Info Stealer Logs

TL;DR — Tens of millions of credentials obtained from info stealer logs populated by malware were posted to Telegram channels last month and used to shake down companies for bug bounties under the misrepresentation the data originated from their service.

How many attempted scams do you get each day? I woke up to yet another "redeem your points" SMS this morning, I'll probably receive a phone call from "my bank" today (edit: I was close, it was "Amazon Prime" 🤷‍♂️) and don't even get me started on my inbox. We're bombarded to the point of desensitisation, which itself is dangerous because it creates the risk of inadvertently dismissing something that really does require your attention. Which brings me to the email Scott Helme from Report URI (disclosure: a service I've long partnered with and advised) received yesterday titled "Bug bounty Program - PII leak Credentials more than 170". It began as follows:

Through open-source intelligence gathering, I discovered a significant amount of "report-uri.comuser credentials and sensitive documents have been leaked and are publicly accessible.

The sender then attached a text file with 197 lines of email addresses and passwords belonging to users of Scott's pride and joy. The first lines looked like this (url:email:password):

Begging for Bounties and More Info Stealer Logs

Imagine the heart-in-mouth moment he had when first seeing that; had someone compromised his service? Was this the data of his customers who had entrusted it to him and it was now floating around the internet? Isn't he the guy who's meant to be teaching others about application security?! The email went on:

The impact of this vulnerability is severe, potentially resulting in:
Mass account takeovers by malicious actors.
Exposure of sensitive user data including names, emails, addresses, and documents.
Unauthorized transactions or malicious activities using compromised accounts.
Further compromise of organizational infrastructure through account abuse.
Financial and reputational damage due to security breaches.

Just to avoid any semblance of doubt as to the motive of the sender, the subject began by flagging the desire for a bug bounty (Report URI does not advertise a bounty program, but clearly a reward was being sought), followed by an email body stating it related to leaked Report URI credentials and then highlighted that "this vulnerability is severe". And then there's that last line about financial and reputation damage. It looked bad. However, cooler heads prevailed, and we started looking closer at the email addresses in the "breach" by checking them against Have I Been Pwned. Very quickly, a pattern emerged:

Begging for Bounties and More Info Stealer Logs
Begging for Bounties and More Info Stealer Logs
Begging for Bounties and More Info Stealer Logs

Most of the addresses we checked had appeared in the lists posted to Telegram I'd loaded into HIBP a couple of months ago. These were stealer logs, not a breach of Report URI! To validate that assertion, I pulled the original data source and parsed out every line containing "report-uri.com". Sure enough, the lines from the file sent to Scott were usually contained in the stealer log files. So, let's talk about how this works:

Take the URL you saw at the beginning of each line earlier on, the one being for the registration page. Here's what it looks like:

Begging for Bounties and More Info Stealer Logs

Now, imagine you're filling out this form and your machine is infected with malware that can observe the data entered into each field. It takes that data, "steals" it and logs it at the attacker's server, hence the term "info stealer logs". There is absolutely nothing Scott can do to prevent this; the user's machine is compromised, not Report URI.

To illustrate the point, I grabbed the first email address in the file Scott was sent and pulled the rows just for that address rather than solely the Report URI rows. This would show us all the other services this person's credentials were snared from, and there were dozens. Here are just the first ten:

Begging for Bounties and More Info Stealer Logs

Google. Apple. Twitter. Most with the same password too, because a normal person obviously owns this email address. So, has each of these organisations also received a beg bounty? No, that's not a typo, this is classic behaviour where unsophisticated and self-proclaimed "security researchers" use automated tooling to identify largely benign security configurations that could be construed as vulnerabilities. For example, they'll send through a report that an SPF record is too permissive (they probably can't even spell "SPF", let alone understand the nuances of sender policies), then try to shake people like Scott down for money under the guise of a "bug bounty". This isn't Scott's problem, nor is it Google's or Apple's or Twitter's, it's something only the malware infected victim's can address.

In this post, I referred to "most" of the addresses already being in HIBP and the lines from the file he was sent "usually" occurring in the logs I had. But there were gaps. For example, whilst there were 197 rows in "his" file, I only found 161 in the data I'd previously loaded. But I had a hunch on how to fill that gap and make up the difference...

Two weeks ago, I was sent a further 22GB of stealer logs found in Telegram channels. Unlike the previous corpus of data, this set contained only stealer logs (no credential stuffing lists) and had a total of 26,105,473 unique email addresses. That's significant, as it implies that every single one of those addresses belongs to someone infected with malware that's stealing their creds. Of the total count, 89.7% had been seen in previous data breaches already in HIBP which is a high crossover, but it also meant that 2,679,550 addresses were all new. I'd been considering whether or not it made sense to load this data given corpuses such as this create frustration when people don't know which site their record was snared from nor which password was impacted. One particular frustration you'll read in comments on the previous post was that people weren't sure whether their email address was in a stealer log or a credential stuffing list; did they have a machine infected with malware or was it merely recycled credentials from an old data breach? But given the way in which this new corpus of data is being used (to attempt to scam Scott and, one would assume, many others), the 7-figure number of previously unseen addresses and the fact that this time, they can all emphatically be tied back to malware campaigns, this is now searchable in HIBP as "Stealer Logs Posted to Telegram".

Ultimately, this is just scam on top of scam: the victims in the logs have had their credentials scammed, and the person who emailed Scott attempted to use that to scam him out of a bounty. Making data like this searchable in HIBP helps people do exactly what I did as soon as Scott forwarded me over the email: validate the origin and as Scott will now do, send a terse reply encouraging the guy to show some decency and stop with the beg bounties.

Lastly, I'm increasingly conscious of how useful the information contained in stealer logs is to organisations like Report URI, and after loading that previous corpus posted from Telegram, I did help out a few companies who thought they might have been hit by it. The position they were coming from was "we keep seeing account takeovers by what looks like credential stuffing attacks, but the attackers are getting the credentials right on the first go". When I pulled the data for their domain as I later did for Report URI, the email addresses were precisely the ones being targeted for account takeover. I want to address this via HIBP, but it's non-trivial for a variety of reasons, especially those related to privacy. In order for this data to be useful to companies like Report URI, I'd need to give them other people's email addresses (the password wouldn't be necessary) based on the assumption they were customers of the service. I'm working out how to do this in way that makes sense for everyone (well, everyone except for the bad guys), stay tuned for more and please do chime in via the comments if you have ideas on how to turn this into a useful service.

Presently sponsored by: 1Password Extended Access Management: Secure every sign-in for every app on every device.

Telegram Combolists and 361M Email Addresses

Last week, a security researcher sent me 122GB of data scraped out of thousands of Telegram channels. It contained 1.7k files with 2B lines and 361M unique email addresses of which 151M had never been seen in HIBP before. Alongside those addresses were passwords and, in many cases, the website the data pertains to. I've loaded it into Have I Been Pwned (HIBP) today because there's a huge amount of previously unseen email addresses and based on all the checks I've done, it's legitimate data. That's the high-level overview, now here are the details:

Telegram is a popular messaging platform that makes it easy to stand up a "channel" and share information to those who wish to visit it. As Telegram describes the service, it's simple, private and secure and as such, has become very popular with those wishing to share content anonymously, including content related to data breaches. Many of the breaches I've previously loaded into HIBP have been distributed via Telegram as it's simple to publish this class of data to the platform. Here's what data posted to Telegram often looks like:

Telegram Combolists and 361M Email Addresses

These are referred to as "combolists", that is they're combinations of email addresses or usernames and passwords. The combination of these is obviously what's used to authenticate to various services, and we often see attackers using these to mount "credential stuffing" attacks where they use the lists to attempt to access accounts en mass. The list above is simply breaking the combos into their respective email service providers. For example, that last Gmail example contains over a quarter of a million rows like this:

Telegram Combolists and 361M Email Addresses

That's only one of many files across many different Telegram channels. The data that was sent to me last week was sourced from 518 different channels and amounted to 1,748 separate files similar to the one above. Some of the files have literally no data (0kb), others are many gigabytes with many tens of millions of rows. For example, the largest file starts like this:

Telegram Combolists and 361M Email Addresses

That looks very much like the result of info stealer malware that has obtained credentials as they were entered into websites on compromised machines. For example, the first record appears to have been snared when someone attempted to login to Nike. There's an easy way to get a sense of the accuracy of this data, just head over to the Nike homepage and click the login link which presents the following screen:

Telegram Combolists and 361M Email Addresses

They serve the same page to both existing subscribers and new ones but then serve different pages depending on whether the email address already has an account (a classic enumeration vector). Mash the keyboard to create a fake email address and you'll be shown a registration form, but enter the address in the stealer log and, well, you get something different:

Telegram Combolists and 361M Email Addresses

The email address has an account, hence the prompt for a password. I'm not going to test the password because that would constitute unauthorised access, but I also don't need to as the goal has already been achieved: I've demonstrated that the address has an account on Nike. (Also note that if the password didn't work it wouldn't necessarily mean it wasn't valid at some point in time at the past, it would simply mean it isn't valid now.)

Footlocker tries to be a bit more clever in avoiding enumeration on password reset, but they'll happily tell you via the registration page if the email address you've entered already exists:

Telegram Combolists and 361M Email Addresses

Even the Italian tyre retailer happily confirmed the existence of the tested account:

Telegram Combolists and 361M Email Addresses

Time and time again, each service I tested confirmed the presence of the email address in the stealer log. But are (or were) the passwords correct? Again, I'm not going to test those myself, but I have nearly 5M subscribers in HIBP and there's always a handful of them in any new breach that are happy to help out. So, I emailed some of the most recent ones, asked if they could help with verification and upon confirmation, sent them their data.

In reaching out to existing subscribers, I expected some repetition in terms of them already appearing in existing data breaches. For one person already in 13 different breaches in HIBP, this was their response:

Thanks Troy. These details were leaked in previous data breaches.

So accurate, but not new, and several of the breaches for this one were of a similar structure to the one we're talking about today in terms of them being combolists used for credential stuffing attacks. Same with another subscriber who was in 7 prior breaches:

Yes that’s familiar. Most likely would have used those credentials on the previous data breaches. 

That one was more interesting as of the 7 prior breaches, only 6 had passwords exposed and none of them were combolists. Instead, it was incidents including MyFitnessPal, 8fit, FlexBooker, Jefit, MyHeritage and ShopBack; have passwords been cracked out of those (most were hashed) and used to create new lists? Very possibly. (Sidenote: this unfortunate person is obviously a bit of a fitness buff and has managed to end up in 3 different "fit" breaches.)

Another subscriber had an entry in the following format, similar to what we saw earlier on in the stealer log:

https://accounts.epicgames.com/login:[email]:[password]

They responded to my queries with the following:

I think that epic games account was for my daughter a couple of years ago but I cancelled it last year from memory. That sds like a password she may have chosen so I'll check with her in an hour or two when I see her again. 

And then, a little bit later

My daughter doesn't remember if that was her password as it was 4-5 years ago when she was only 8-9 years old. However it does sound like something she would have chosen so in all probability, I would say that is a legitimate link. We believe it was used when she played a game called Fortnite which she did infrequently at that time hence her memory is sketchy. 

I realised that whilst each of these responses confirmed the legitimacy of the data, they really weren't giving me much insight into the factor that made it worth loading into HIBP: the unseen addresses. So, I went through the same process of contacting HIBP subscribers again but this time, only the ones that I'd never seen in a breach before. This would then rule out all the repurposed prior incidents and give me a much better idea of how impactful this data really was. And that's when things got really interesting.

Let's start with the most interesting one and what you're about to see is two hundred rows of stealer logs:

https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication [email]:[password]
https://www.disneyplus.com/de-de/reset-password [email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth [email]:[password]
https://www.tink.de/checkout/login [email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll [email]:[password]
https://vrr-db-ticketshop.de/authentication/login [email]:[password]
https://www.planet-sports.de/checkout/register [email]:[password]
https://www.bstn.com/eu_de/checkout/ [email]:[password]
https://www.lico-nature.de/index.php [email]:[password]
https://ticketshop.mobil.nrw/authentication/register [email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer [email]:[password]
https://www.zurbrueggen.de/checkout/register [email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile [email]:[password]
https://www.bluemovement.com/de-de/checkout2 [email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/[email]:[password]
https://members.persil-service.de/login/ [email]:[password]
https://www.nicotel.de/index.php [email]:[password]
https://www.hellofresh.de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register [email]:[password]
https://signup.sipgateteam.de/ [email]:[password]
https://www.baur.de/kasse/registrieren [email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3 [email]:[password]
https://www.qvc.de/checkout/your-information.html [email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details [email]:[password]
https://www.shop-apotheke.com/nx/login/ [email]:[password]
https://druckmittel.de/checkout/confirm [email]:[password]
https://www.global-carpet.de/checkout/confirm [email]:[password]
https://software-hero.de/checkout/confirm [email]:[password]
https://myenergykey.com/login [email]:[password]
https://www.sixt.de/ [email]:[password]
https://www.wlan-shop24.de/Bestellvorgang [email]:[password]
https://www.cyberport.de/checkout/anmelden.html [email]:[password]
https://waschmal.de/registerCustomer [email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped [email]:[password]
https://www.persil-service.de/signup [email]:[password]
https://nicotel.de/ [email]:[password]
https://temial.vorwerk.de/register/checkout [email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action [email]:[password].
https://www.petsdeli.de/login [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten [email]:[password]
https://v3.account.samsung.com/iam/passwords/register [email]:[password]
https://www.amazon.pl/ap/signin [email]:[password]
https://www.amazon.de/ [email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
www.disneyplus.com/de-de/reset-password:[email]:[password]
auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
www.tink.de/checkout/login:[email]:[password]
signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
vrr-db-ticketshop.de/authentication/login:[email]:[password]
www.planet-sports.de/checkout/register:[email]:[password]
www.bstn.com/eu_de/checkout/:[email]:[password]
www.lico-nature.de/index.php:[email]:[password]
ticketshop.mobil.nrw/authentication/register:[email]:[password]
softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
www.zurbrueggen.de/checkout/register:[email]:[password]
www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
www.bluemovement.com/de-de/checkout2:[email]:[password]
members.persil-service.de/login/:[email]:[password]
www.nicotel.de/index.php:[email]:[password]
www.hellofresh.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
signup.sipgateteam.de/:[email]:[password]
www.baur.de/kasse/registrieren:[email]:[password]
buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
www.qvc.de/checkout/your-information.html:[email]:[password]
de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
www.shop-apotheke.com/nx/login/:[email]:[password]
druckmittel.de/checkout/confirm:[email]:[password]
www.global-carpet.de/checkout/confirm:[email]:[password]
software-hero.de/checkout/confirm:[email]:[password]
myenergykey.com/login:[email]:[password]
www.sixt.de/:[email]:[password]
www.wlan-shop24.de/Bestellvorgang:[email]:[password]
www.cyberport.de/checkout/anmelden.html:[email]:[password]
waschmal.de/registerCustomer:[email]:[password]
www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
www.persil-service.de/signup:[email]:[password]
nicotel.de/:[email]:[password]
temial.vorwerk.de/register/checkout:[email]:[password]
accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
www.petsdeli.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
www.netflix.com/de/login:[email]:[password]
www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
v3.account.samsung.com/iam/passwords/register:[email]:[password]
www.amazon.pl/ap/signin:[email]:[password]
www.amazon.de/:[email]:[password]
meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]

Even without seeing the email address and password, the commonality is clear: German websites. Whilst the email address is common, the passwords are not... at least not always. In 168 instances they were near identical with only a handful of them deviating by a character or two. There's some duplication across the lines (9 different rows of Netflix, 4 of Disney Plus, etc), but clearly this remains a significant volume of data. But is it real? Let's find out:

The data seems accurate so far. I have already changed some of the passwords as I was notified by the provider that my account was hacked. It is strange that the Telekom password was already generated and should not be guessable. I store my passwords in Firefox, so is it possible that they were stolen from there?

It's legit. Stealer malware explains both the Telekom password and why passwords in Firefox were obtained; there's not necessarily anything wrong with either service, but if a machine is infected with software that can grab passwords straight out of the fields they've been entered into in the browser, it's game over.

We started having some to-and-fro as I gathered more info, especially as it related to the timeframe:

It started about a month ago, maximum 6 weeks. I use a Macbook and an iPhone, only a Windows PC at work, maybe it happened there? About a week ago there was an extreme spam attack on my Gmail account, and several expensive items were ordered with my accounts in the same period, which fortunately could be canceled.

We had the usual discussion about password managers and of course before that, tracking down which device is infected and siphoning off secrets. This was obviously distressing for her to see all her accounts laid out like this, not to mention learning that they were being exchanged in channels frequented by criminals. But from the perspective of verifying both the legitimacy and uniqueness of the data (not to mention the freshness), this was an enormously valuable exchange.

Next up was another subscriber who'd previously dodged all the data breaches in HIBP yet somehow managed to end up with 53 rows of data in the corpus:

[email]:Gru[redacted password]
[email]:fux[redacted password]
[email]:zWi[redacted password]
[email]:6ii[redacted password]
[email]:qTM[redacted password]
[email]:Pre[redacted password]
[email]:i8$[redacted password]
[email]:9cr[redacted password]
[email]:fuc[redacted password]
[email]:kuM[redacted password]
[email]:Fuc[redacted password]
[email]:Pre[redacted password]
[email]:Vxt[redacted password]
[email]:%3r[redacted password]
[email]:But[redacted password]
[email]:1qH[redacted password]
[email]:^VS[redacted password]
[email]:But[redacted password]
[email]:Nbs[redacted password]
[email]:*W2[redacted password]
[email]:$aM[redacted password]
[email]:DA^[redacted password]
[email]:vPE[redacted password]
[email]:Z8u[redacted password]
[email]:But[redacted password]
[email]:aXi[redacted password]
[email]:rPe[redacted password]
[email]:b4F[redacted password]
[email]:2u&[redacted password]
[email]:5%f[redacted password]
[email]:Lmt[redacted password]
[email]:p
[email]:Tem[redacted password]
[email]:fuc[redacted password]
[email]:*e@[redacted password]
[email]:(k+[redacted password]
[email]:Ste[redacted password]
[email]:^@f[redacted password]
[email]:XT$[redacted password]
[email]:25@[redacted password]
[email]:Jav[redacted password]
[email]:U8![redacted password]
[email]:LsZ[redacted password]
[email]:But[redacted password]
[email]:g$V[redacted password]
[email]:M9@[redacted password]
[email]:!6D[redacted password]
[email]:Fac[redacted password]
[email]:but[redacted password]
[email]:Why[redacted password]
[email]:h45[redacted password]
[email]:blo[redacted password]
[email]:azT[redacted password]

I've redacted everything after the first three characters of the password so you can get a sense of the breadth of different ones here. In this instance, there was no accompanying website, but the data checked out:

Oh damn a lot of those do seem pretty accurate. Some are quite old and outdated too. I tend to use that gmail account for inconsequential shit so I'm not too fussed, but I'll defintely get stuck in and change all those passwords ASAP. This actually explains a lot because I've noticed some pretty suspicious activity with a couple of different accounts lately.

Another with 35 records of website, email and password triplets responded as follows (I'll stop pasting in the source data, you know what that looks like by now):

Thank you very much for the information, although I already knew about this (I think it was due to a breach in LastPass) and I already changed the passwords, your information is much more complete and clear. It helped me find some pages where I haven't changed the password.

The final one of note really struck a chord with me, not because of the thrirteen rows of records similar to the ones above, but because of what he told me in his reply:

Thank you for your kindness. Most of these I have been able to change the passwords of and they do look familiar. The passwords on there have been changed. Is there a way we both can fix this problem as seeing I am only 14?

That's my son's age and predictably, all the websites listed were gaming sites. The kid had obviously installed something nasty and had signed up to HIBP notifications only a week earlier. He explained he'd recently received an email attempting to extort him for $1.3k worth of Bitcoin and shared the message. It was clearly a mass-mailed, indiscriminate shakedown and I advised him that it in no way targeted him directly. Concerned, he countered with a second extortion email he'd received, this time it was your classic "we caught you watching porn and masturbating" scam, and this one really had him worried:

I have been stressed and scared about these scams (even though I shouldn’t be). I have been very stressed and scared today because of another one of those emails.

Imagine being a young teenage boy and receiving that?! That's the sort of thing criminals frequenting Telegram channels such as the ones in question are using this data for, and it's reprehensible. I gave him some tips (I see the sorts of things my son's friends randomly install!) and hopefully, that'll set him on the right course.

They were the most noteworthy responses, the others that were often just a single email address and password pair just simply reinforced the same message:

Yes, this is an old password that I have used in the past, and matches the password of my accounts that had been logged into recently.

And:

Yes that password is familiar and accurate. I used to practice password re-use with this password across many services 5+ years ago.This makes it impossible to correlate it to a particular service or breach. It is known to me to be out there already, I've received crypto extortion emails containing it.

I know that many people who find themselves in this incident will be confused; which breach is it? I've never used Telegram before, why am I there? Those questions came through during my verification process and I know from loading previous similar breaches, they'll come up over and over again in the coming days and I hope that the overview above sufficiently answers these.

The questions that are harder to answer (and again, I know these will come up based on prior experience), are what the password is that was exposed, what the website it appeared next to was and, indeed, if it appeared next to a website at all or just alongside an email address. Right at the beginning of this project more than a decade ago, I made the decision not to load the data that would answer these questions due to the risk it posed to individuals and by extension, the risk to my ability to continue running HIBP. We were reminded of how important this decision was earlier in the year when a service aggregating data breaches left the whole thing exposed and put everyone in there at even more risk.

So, if you're in here, what do you do? It's a repeat of the same old advice we've been giving in this industry for decades now, namely keeping devices patched and updated, running security software appropriate for your device (I use Microsoft Defender on my PCs), using strong and unique passwords (get a password manager!) and enabling 2FA wherever possible. Each HIBP subscriber I contacted wasn't doing at least one of these things, which was evident in their password selection. Time and time again, passwords consisted of highly predictable patterns and often included their name, year of birth (I assume) and common character substitutions, usually within a dozen characters of length too. It's the absolute basics that are going wrong here.

To the point one of the HIBP subscribers made above, loading this data will help many people explain why they've been seeing unusual behaviour on their accounts. It's also the wakeup call to lift everyone's security game per the previous paragraph. But this also isn't the end of it, and more combolists have been posted in more Telegram channels since loading this incident. Whilst I'm still of the view from years ago that I'm not going to continuously load endless lists, I do hope people recognise that their security posture is an ongoing concern and not just something you think about after appearing in a breach.

The data is now searchable in Have I Been Pwned.

Presently sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite

Operation Endgame

Today we loaded 16.5M email addresses and 13.5M unique passwords provided by law enforcement agencies into Have I Been Pwned (HIBP) following botnet takedowns in a campaign they've coined Operation Endgame. That link provides an excellent overview so start there then come back to this blog post which adds some insight into the data and explains how HIBP fits into the picture.

Since 2013 when I kicked off HIBP as a pet project, it has become an increasingly important part of the security posture of individuals, organisations, governments and law enforcement agencies. Gradually and organically, it has found a fit where it's able to provide a useful service to the good guys after the bad guys have done evil cyber things. The phrase I've been fond of this last decade is that HIBP is there to do good things with data after bad things happen. The reputation and reach the service has gained in this time has led to partnerships such as the one you're reading about here today. So, with that in mind, let's get into the mechanics of the data:

In terms of the email addresses, there were 16.5M in total with 4.5M of them not having been seen in previous data breaches already in HIBP. We found 25k of our own individual subscribers in the corpus of data, plus another 20k domain subscribers which is usually organisations monitoring the exposure of their customers (all of these subscribers have now been sent notification emails). As the data was provided to us by law enforcement for the public good, the breach is flagged as subscription free which means any organisation that can prove control of the domain can search it irrespective of the subscription model we launched for large domains in August last year.

The only data we've been provided with is email addresses and disassociated password hashes, that is they don't appear alongside a corresponding address. This is the bare minimum we need to make that data searchable and useful to those impacted. So, let's talk about those standalone passwords:

There are 13.5 million unique passwords of which 8.9M were already in Pwned Passwords. Those passwords have had their prevalence counts updated accordingly (we received counts for each password with many appearing in the takedown multiple times over), so if you're using Pwned Passwords already, you'll see new numbers next to some entries. That also means there are 4.6M passwords we've never seen before which you can freely download using our open source tool. Or even better, if you're querying Pwned Passwords on demand you don't need to do anything as the new entries are automatically added to the result set. All this is made possible by feeding the data into the law enforcement pipeline we built for the FBI and NCA a few years ago.

A quick geek-out moment on Pwned Passwords: at present, we're serving almost 8 billion requests per month to this service:

Operation Endgame

Taking just last week as an example, we're a rounding error off 100% of requests being served directly from Cloudflare's cache:

Operation Endgame

That's over 99.99% of all requests during that period that were served from one of Cloudflare's edge nodes that sit in 320 cities globally. What that means for consumers of the service is massively fast response times due to the low latency of serving content from a nearby location and huge confidence in availability as there's only about a one-in-ten-thousand chance of the request being served by our origin service. If you'd like to know more about how we achieved this, check out my post from a year ago on using Cloudflare Cache Reserve.

After pushing out the new passwords today, all but 5 hash prefixes were modified (read more about how we use hashes to enable anonymous password searches) so we did a complete Cloudflare cache flush. By the time you read this, almost the entire 16^5 possible hash ranges have been completely repopulated into cache due to the volume of requests the service receives:

Operation Endgame

Lastly, when we talk about passwords in HIBP, the inputs we receive from law enforcement consist of 3 parts:

  1. A SHA-1 hash
  2. An NTLM hash
  3. A count of how many times the password appears

The rationale for this is explained in the links above but in a nutshell, the SHA-1 format ensures any badly parsed data that may inadvertently include PII is protected and it aligns with the underlying data structure that drives the k-anonymity searches. We have NTLM hashes as well because many orgs use them to check passwords in their own Active Directory instances.

So, what can you do if you find your data in this incident? It's a similar story to the Emotet malware provided by the FBI and NHTCU a few years ago in that the sage old advice applies: get a password manager and make them all strong and unique, turn on 2FA everywhere, keep machines patched, etc. If you find your password in the data (the HIBP password search feature anonymises it before searching, or password managers like 1Password can scan all of your passwords in one go), obviously change it everywhere you've used it.

This operation will be significant in terms of the impact on cybercrime, and I'm glad we've been able to put this little project to good use by supporting our friends in law enforcement who are doing their best to support all of us as online citizens.

Presently sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite

Have I Been Pwned Employee 1.0: Stefán Jökull Sigurðarson

We often do that in this industry, the whole "1.0" thing, but it seems apt here. I started Have I Been Pwned (HIBP) in 2013 as a pet project that scratched an itch, so I never really thought of myself as an "employee". Over time, it grew (and I tell you what, nobody is more surprised by that than me!) and over the last few years, my wife Charlotte got more and more involved. Technically, we're both employees and we work on HIBP things but we're like, well, beta versions.

Today, I'm very happy to announce our first full-time, production-ready employee: Stefán Jökull Sigurðarson. This is both a massive commitment on Charlotte's and my part and a leap of faith on Stefán's and deserves some background:

I suffer somewhat from what I'll call the "founder's paradox", that is I find myself having built something genuinely useful and wanting to see it grow and mature yet also not wanting to let go. I want to be involved in everything, but I also want to go on holidays sometimes and tune out. I like making decisions on every aspect of how the service runs, but I want it to outlive me. Bringing any outside party into any business can be hard to come to terms with, but especially in the case of HIBP where it's become so critical to so many people and deals with so much sensitive data. Which is why I have to trust people like Stefán because if I don't, I'm one shark / snake / croc incident away from disappointing a lot of people.

Trust is the cornerstone of why Stefán is joining us now. Not just trust in his technical skills, but trust in him as a person. I've known Stefán for many years now, initially when he came to one of my Hack Yourself First workshops in Oslo back in 2018, then as a blogger writing about how he was implementing Pwned Passwords at EVE Online, then as conference speaker himself, a Microsoft MVP, and in 2021, as the person who selflessly gave up his own time to support the open source Pwned Passwords. What we never made any formal announcements about is that we did hire Stefán on a part-time basis beginning earlier last year to help out with the coding when he had free cycles amidst his full-time work. That went great and he obviously enjoyed working at HIBP so earlier this month, Stefán handed in his resignation and will shortly be a full-time employee.

I'm really happy with the timing of this and how it's all worked out. We're in a position to make the financial commitment largely because of finally putting a price on searches for large domains last year. What this has allowed us to do is shift money from companies who see value in the service (more than half the Fortune 500 use the domain search feature), and reinvest it into making HIBP more sustainable. Getting Stefán onboard is the manifestation of that investment and you'll very shortly see his work begin to translate into highly visible new features. But what you won't see is the stuff that's even more important, especially as it relates to running a more sustainable service that no longer has me as a single point of failure.

So, welcome Stefán, and thank you for your commitment 😊

Oh - just one more thing: I was looking around for a great hero image for this blog post and I found this awesome video of Stefán swimming through a semi-frozen Norwegian fjord before riding an iceberg. For real, this it perhaps the most Nordic thing I've ever seen (Stefán being from Iceland and all), but unfortunately videos don't really lend themselves to hero images, so I went switch a stylised AI-generated rendition of the event.

Presently sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite

Inside the Massive Alleged AT&T Data Breach

I hate having to use that word - "alleged" - because it's so inconclusive and I know it will leave people with many unanswered questions. But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined. We're here at "alleged" for two very simple reasons: one is that AT&T is saying "the data didn't come from us", and the other is that I have no way of proving otherwise. But I have proven, with sufficient confidence, that the data is real and the impact is significant. Let me explain:

Firstly, just as a primer if you're new to this story, read BleepingComputer's piece on the incident. What it boils down to is in August 2021, someone with a proven history of breaching large organisations posted what they claimed were 70 million AT&T records to a popular hacking forum and asked for a very large amount of money should anyone wish to purchase the data. From that story:

From the samples shared by the threat actor, the database contains customers' names, addresses, phone numbers, Social Security numbers, and date of birth.

Fast forward two and a half years and the successor to this forum saw a post this week alleging to contain the entire corpus of data. Except that rather than put it up for sale, someone has decided to just dump it all publicly and make it easily accessible to the masses. This isn't unusual: "fresh" data has much greater commercial value and is often tightly held for a long period before being released into the public domain. The Dropbox and LinkedIn breaches, for example, occurred in 2012 before being broadly distributed in 2016 and just like those incidents, the alleged AT&T data is now in very broad circulation. It is undoubtedly in the hands of thousands of internet randos.

AT&T's position on this is pretty simple:

AT&T continues to tell BleepingComputer today that they still see no evidence of a breach in their systems and still believe that this data did not originate from them.

The old adage of "absence of evidence is not evidence of absence" comes to mind (just because they can't find evidence of it doesn't mean it didn't happen), but as I said earlier on, I (and others) have so far been unable to prove otherwise. So, let's focus on what we can prove, starting with the accuracy of the data.

The linked article talks about the author verifying the data with various people he knows, as well as other well-known infosec identities verifying its accuracy. For my part, I've got 4.8M Have I Been Pwned (HIBP) subscribers I can lean on to assist with verification, and it turns out that 153k of them are in this data set. What I'll typically do in a scenario like this is reach out to the 30 newest subscribers (people who will hopefully recall the nature of HIBP from their recent memory), and ask them if they're willing to assist. I linked to the story from the beginning of this blog post and got a handful of willing respondents for whom I sent their data and asked two simple questions:

  1. Does this data look accurate?
  2. Are you an AT&T customer and if not, are you a customer of another US telco?

The first reply I received was simple, but emphatic:

Inside the Massive Alleged AT&T Data Breach

This individual had their name, phone number, home address and most importantly, their social security number exposed. Per the linked story, social security numbers and dates of birth exist on most rows of the data in encrypted format, but two supplemental files expose these in plain text. Taken at face value, it looks like whoever snagged this data also obtained the private encryption key and simply decrypted the vast bulk (but not all of) the protected values.

Inside the Massive Alleged AT&T Data Breach

The above example simply didn't have plain text entries for the encrypted data. Just by way of raw numbers, the file that aligns with the "70M" headline actually has 73,481,539 lines with 49,102,176 unique email addresses. The file with decrypted SSNs has 43,989,217 lines and the decrypted dates of birth file only has 43,524 rows. The last file, for example, has rows that look just like this:

.encrypted_value='*0g91F1wJvGV03zUGm6mBWSg==' .decrypted_value='1996-07-18'

That encrypted value is precisely what appears in the large file hence providing an easy way of matching all the data together. But those numbers also obviously mean that not every impacted individual had their SSN exposed, and most individuals didn't have their date of birth leaked.

Inside the Massive Alleged AT&T Data Breach

As I'm fond of saying, there's only one thing worse than your data appearing on the dark web: it's appearing on the clear web. And that's precisely where it is; the forum this was posted to isn't within the shady underbelly of a Tor hidden service, it's out there in plain sight on a public forum easily accessed by a normal web browser. And the data is real.

That last response is where most people impacted by this will now find themselves - "what do I do?" Usually I'd tell them to get in touch with the impacted organisation and request a copy of their data from the breach, but if AT&T's position is that it didn't come from them then they may not be much help. (Although if you are a current or previous customer, you can certainly request a copy of your personal information regardless of this incident.) I've personally also used identity theft protection services since as far back as the 90's now, simply to know when actions such as credit enquiries appear against my name. In the US, this is what services like Aura do and it's become common practice for breached organisations to provide identity protection subscriptions to impacted customers (full disclosure: Aura is a previous sponsor of this blog, although we have no ongoing or upcoming commercial relationship).

What I can't do is send you your breached data, or an indication of what fields you had exposed. Whilst I did this in that handful of aforementioned cases as part of the breach verification process, this is something that happens entirely manually and is infeasible en mass. HIBP only ever stores email addresses and never the additional fields of personal information that appear in data breaches. In case you're wondering why that is, we got a solid reminder only a couple of months ago when a service making this sort of data available to the masses had an incident that exposed tens of billions of rows of personal information. That's just an unacceptable risk for which the old adage of "you cannot lose what you do not have" provides the best possible fix.

As I said in the intro, this is not the conclusive end I wanted for this blog post... yet. As impacted HIBP subscribers receive their notifications and particularly as those monitoring domains learn of the aliases in the breach (many domain owners use unique aliases per service they sign up to), we may see a more conclusive outcome to this incident. That may not necessarily be confirmation that the data did indeed originate from AT&T, it could be that it came from a third party processor they use or from another entity altogether that's entirely unrelated. The truth is somewhere there in the data, I'll add any relevant updates to this blog post if and when it comes out.

As of now, all 49M impacted email addresses are searchable within HIBP.

Presently sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite

The Data Breach "Personal Stash" Ecosystem

I've always thought of it a bit like baseball cards; a kid has a card of this one player that another kid is keen on, and that kid has a card the first one wants so they make a trade. They both have a bunch of cards they've collected over time and by virtue of existing in the same social circles, trades are frequent, and cards flow back and forth on a regular basis. That's the analogy I often use to describe the data breach "personal stash" ecosystem, but with one key difference: if you trade a baseball card then you no longer have the original card, but if you trade a data breach which is merely a digital file, it replicates.

There are personal stashes of data breaches all over the place and they're usually presented like this one:

The Data Breach "Personal Stash" Ecosystem

You'll recognise many of those names because they're noteworthy incidents that received a bunch of press. My Space. Adobe. LinkedIn. Ashley Madison.

The same incidents appear here:

The Data Breach "Personal Stash" Ecosystem

And so on and so forth. Stashes of breaches like this are all over the place and they fuel an exchange ecosystem that replicates billions of records of personal data over and over again. Your data. My data. The data of a significant portion of the global internet-using population, just freely flowing backwards and forwards not just in the shady corners of "the dark web" but traded out there in the clear on mainstream websites. Until inevitably:

Diogo Santos Coelho was 14 when he started RaidForums, and was 21 by the time he was arrested for running the service 2 years ago. A kid, exchanging data without the maturity to understand the consequences of his actions. RaidForums left a void that was quickly filled by BreachForums:

The Data Breach "Personal Stash" Ecosystem

Conor Fitzpatrick was 20 years old when he was finally picked up for running the service last year. Still just a kid, at least in the colloquial fashion in which we refer to youngsters as when we get a bit older, but surely still legally a minor when he chose to begin collecting data breaches.

Websites like these are taken down for a simple reason:

The ecosystem of personal stashes exchanged with other parties fuels crime.

For example, data breaches seed services set up with the express intent of monetising a broad range of personal attributes to the detriment of people who are already victims of a breach. Call them shady versions of Have I Been Pwned if you will, and this talk I gave at AusCERT a couple of years ago is a great explainer (deep-linked to the start of that segment):

The first service I spoke about in that segment was We Leak Info and it was run by two 22 year old guys. The website first appeared 3 years earlier - only a year after the creators had left childhood - and it allowed anyone with the money to access anyone else's personal data including:

names, email addresses, usernames, phone numbers, and passwords

One of the duo was later sentenced to 2 years in prison for his role, and when you read the sorts of conversations they were having, you can't help but think they behaved exactly like you'd expect a couple of young guys who thought they were anonymous would:

The Data Breach "Personal Stash" Ecosystem

In the video, I mentioned Jordan Bloom in relation to LeakedSource, a veritable older gentleman of this class of crime being 24 when the site first appeared. He eventually pled guilty to charges that included trafficking identity information and when you read what that involved, you can see why this would attract the ire of law enforcement agencies:

However, unlike other breach notification services, such as Have I Been Pwned, LeakedSource also gave subscribers access to usernames, passwords (including in clear text), email addresses and IP addresses. LeakedSource services were often advertised on hacking forums and there was suspicion that its operators were actively looking to hack organizations whose data they could add to their database.

In 2016, a well-wisher purchased my own data from LeakedSource and sent over a dozen different records similar to this one:

The Data Breach "Personal Stash" Ecosystem

Not mentioned in my talk but running in the same era was Leakbase, yet another service that collated huge volumes of sensitive data and sold it to absolutely anyone:

The Data Breach "Personal Stash" Ecosystem

And just like all the other ones, the same data appeared over and over again:

The Data Breach "Personal Stash" Ecosystem

It went dark at the end of 2017 amidst speculation the disappearance was tied to the takedown of the Hansa dark web market. If that was the case, why did we never hear of charges being laid as we did with We Leak Info and LeakedSource? Could it be that the operator of Leakbase was only ever so slightly younger than the other guys mentioned above and not having yet reached adulthood, managed to dodge charges? It would certainly be consistent with the demographic pattern of those with personal stashes of data breaches.

Speaking of patterns: We Leak Info, LeakedSource, Leakbase - it's like there's a theme of shady services attached to the word. As I say in the video, there's also a theme of attempting to remain anonymous (which clearly hasn't worked very well!), and a theme of attempting to eschew legal responsibility for how the data is used by merely putting words in the terms of service. For example, here's Jordan's go at deflecting his role in the ecosystem and yes, this was the entire terms of service:

The Data Breach "Personal Stash" Ecosystem

I particularly like this clause:

You may only use this tool for your own personal security and data research. You may only search information about yourself, or those you are authorized in writing to do so.

That's not going to keep you out of trouble! Time and time again, I see this sort of wording on services used as if it's going to make a difference when the law comes asking hard questions; "Hey we literally told people to play nice with the data!"

We Leak Info used similar entertaining wording with some of the highlights including:

  1. We Leak Info strictly prohibits the use of its Services to cause damage or harm to others
  2. You may not use Our Services in acts deemed illegal by the laws in Your region
  3. We Leak Info does not knowingly participate in the act of obtaining or distributing Data
  4. We Leak Info will cooperate with any legal investigations that it determines worthy and valid at its own discretion

That last one in particular is an absolute zinger! But again, remember, we're talking about guys who stood this service up as teenagers and literally worked on the assumption of "as [l]ong as we cooperate they [the FBI] won't fuck with us" 🤦‍♂️ The ignorance of that attitude whilst advertising services on criminal forums is just mind-blowing, even for kids.

All of which brings me to the inspiration for this blog post:

It's like I've seen it all before! No, really, because only a couple of days later someone running a service popped up and claimed responsibility for having exposed the data due to "a firewall misconfiguration". I'm not going to name or link the service, but I will describe a few key features:

  1. After purchasing access, it returns extensive personal information exposed in data breaches including names, email addresses, usernames, phone numbers, and passwords
  2. The operator is clearly trying to remain anonymous with no discoverable information about who is running it
  3. It has ToS that include: "You may only use this service for your own personal security and research. Furthermore, you may only search for information about yourself or those who you are authorized in writing to do so." (I know what you're thinking, so I diff'd it for you)
  4. The name of the service starts with the word "leak"

I could write predictions about the future of this service but if you've read this far and paid attention to the precedents, you can reliably form your own conclusion. The outcome is easily predictable and indeed it was the predictability of the whole situation when I started getting bombarded with queries about the "Mother of all Breaches" that frustrated me; of course it was someone's personal stash, because we've seen it all before and we live in an era where it's dead easy to build services like this. Cloud is ubiquitous and storage is cheap, you can stand up great looking websites in next to no time courtesy of freely available templates, and the whole data breach trading ecosystem I referred to earlier can easily seed services like this.

Maybe the young guy running this service (assuming the previously observed patterns apply) will learn from history and quietly exit while the getting is good, I don't know, time will tell. At the very least, if he reads this and takes nothing else away, don't go driving around in a bright green Lamborghini!

Presently sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite

Inside the Massive Naz.API Credential Stuffing List

It feels like not a week goes by without someone sending me yet another credential stuffing list. It's usually something to the effect of "hey, have you seen the Spotify breach", to which I politely reply with a link to my old No, Spotify Wasn't Hacked blog post (it's just the output of a small set of credentials successfully tested against their service), and we all move on. Occasionally though, the corpus of data is of much greater significance, most notably the Collection #1 incident of early 2019. But even then, the rapid appearance of Collections #2 through #5 (and more) quickly became, as I phrased it in that blog post, "a race to the bottom" I did not want to take further part in.

Until the Naz.API list appeared. Here's the back story: this week I was contacted by a well-known tech company that had received a bug bounty submission based on a credential stuffing list posted to a popular hacking forum:

Inside the Massive Naz.API Credential Stuffing List

Whilst this post dates back almost 4 months, it hadn't come across my radar until now and inevitably, also hadn't been sent to the aforementioned tech company. They took it seriously enough to take appropriate action against their (very sizeable) user base which gave me enough cause to investigate it further than your average cred stuffing list. Here's what I found:

  1. 319 files totalling 104GB
  2. 70,840,771 unique email addresses
  3. 427,308 individual HIBP subscribers impacted
  4. 65.03% of addresses already in HIBP (based on a 1k random sample set)

That last number was the real kicker; when a third of the email addresses have never been seen before, that's statistically significant. This isn't just the usual collection of repurposed lists wrapped up with a brand-new bow on it and passed off as the next big thing; it's a significant volume of new data. When you look at the above forum post the data accompanied, the reason why becomes clear: it's from "stealer logs" or in other words, malware that has grabbed credentials from compromised machines. Apparently, this was sourced from the now defunct illicit.services website which (in)famously provided search results for other people's data along these lines:

Inside the Massive Naz.API Credential Stuffing List

I was aware of this service because, well, just look at the first example query 🤦‍♂️

So, what does a stealer log look like? Website, username and password:

Inside the Massive Naz.API Credential Stuffing List

That's just the first 20 rows out of 5 million in that particular file, but it gives you a good sense of the data. Is it legit? Whilst I won't test a username and password pair on a service (that's way too far into the grey for my comfort), I regularly use enumeration vectors on websites to validate whether an account actually exists or not. For example, take that last entry for racedepartment.com, head to the password reset feature and mash the keyboard to generate a (quasi) random alias @hotmail.com:

Inside the Massive Naz.API Credential Stuffing List

And now, with the actual Hotmail address from that last line:

Inside the Massive Naz.API Credential Stuffing List

The email address exists.

The VideoScribe service on line 9:

Inside the Massive Naz.API Credential Stuffing List

Exists.

And even the service on the very first line:

Inside the Massive Naz.API Credential Stuffing List

From a verification perspective, this gives me a high degree of confidence in the legitimacy of the data. The question of how valid the accompanying passwords remain aside, time and time again the email addresses in the stealer logs checked out on the services they appeared alongside.

Another technique I regularly use for validation is to reach out to impacted HIBP subscribers and simply ask them: "are you willing to help verify the legitimacy of a breach and if so, can you confirm if your data looks accurate?" I usually get pretty prompt responses:

Yes, it does. This is one of the old passwords I used for some online services. 

When I asked them to date when they might have last used that password, they believed it was was either 2020 or 2021.

And another whose details appears alongside a Webex URL:

Yes, it does. but that was very old password and i used it for webex cuz i didnt care and didnt use good pass because of the fear of leaking

And another:

Yes these are passwords I have used in the past.

Which got me wondering: is my own data in there? Yep, turns out it is and with a very old password I'd genuinely used pre-2011 when I rolled over to 1Password for all my things. So that sucks, but it does help me put the incident in more context and draw an important conclusion: this corpus of data isn't just stealer logs, it also contains your classic credential stuffing username and password pairs too. In fact, the largest file in the collection is just that: 312 million rows of email addresses and passwords.

Speaking of passwords, given the significance of this data set we've made sure to roll every single one of them into Pwned Passwords. Stefán has been working tirelessly the last couple of days to trawl through this massive corpus and get all the data in so that anyone hitting the k-anonymity API is already benefiting from those new passwords. And there's a lot of them: it's a rounding error off 100 million unique passwords that appeared 1.3 billion times across the corpus of data 😲 Now, what does that tell you about the general public's password practices? To be fair, there are instances of duplicated rows, but there's also a massive prevalence of people using the same password across multiple difference services and completely different people using the same password (there are a finite set of dog names and years of birth out there...) And now more than ever, the impact of this service is absolutely huge!

Pwned Passwords remains totally free and completely open source for both code and data so do please make use of it to the fullest extent possible. This is such an easy thing to implement, and it has a profound impact on credential stuffing attacks so if you're running any sort of online auth service and you're worried about the impact of Naz.API, this now completely kills any attack using that data. Password reuse remain rampant so attacks of this type prosper (23andMe's recent incident comes immediately to mind), definitely get out in front of this one as early as you can.

So that's the story with the Naz.API data. All the email addresses are now in HIBP and searchable either individually or via domain and all those passwords are in Pwned Passwords. There are inevitably going to be queries along the lines of "can you show me the actual password" or "which website did my record appear against" and as always, this just isn't information we store or return in queries. That said, if you're following the age-old guidance of using a password manager, creating strong and unique ones and turning 2FA on for all your things, this incident should be a non-event. If you're not and you find yourself in this data, maybe this is the prompt you finally needed to go ahead and do those things right now 🙂

Presently sponsored by: Get insights into malware’s behavior with ANY.RUN: instant results, live VM interaction, fresh IOCs, and configs without limit.

A Decade of Have I Been Pwned

A decade ago to the day, I published a tweet launching what would surely become yet another pet project that scratched an itch, was kinda useful to a few people but other than that, would shortly fade away into the same obscurity as all the other ones I'd launched over the previous couple of decades:

And then, as they say, things kinda escalated quickly. The very next day I published a blog post about how I made it so fast to search through 154M records and thus began a now 185-post epic where I began detailing the minutiae of how I built this thing, the decisions I made about how to run it and commentary on all sorts of different breaches. And now, a 10th birthday blog post about what really sticks out a decade later. And that's precisely what this 185th blog post tagging HIBP is - the noteworthy things of the years past, including a few things I've never discussed publicly before.

Pwned?

You know why it's called "Have I Been Pwned"? Try coming up with almost any conceivable normal sounding English name and getting a .com domain for it. Good luck! That was certainly part of it, but another part of the name choice was simply that I honestly didn't expect this thing to go anywhere. It's like I said in the intro of this post where I fully expected this to be another failed project, so why does the name matter?

But it's weird how "pwned" has stuck and increasingly, become synonymous with HIBP. For many people, the first time they ever hear the word is in the context of "Have I Been..." with an ensuing discussion often explaining the origins of the term as it relates to gaming culture. And if you do go and look for a definition of the term online, you'll come across resources such as How “PWNED” went from hacker slang to the internet’s favourite taunt:

Then in 2013, when various web services and sites saw an uptick in personal data breaches, security expert Troy Hunt created the website “Have I Been Pwned?” Anyone can type in an email address into the site to check if their personal data has been compromised in a security breach.

And somehow, this little project is now referenced in the definition of the name it emerged from. Weird.

But, because it's such an odd name that has so frequently been mispronounced or mistyped, I've ended up with a whole raft of bizarre domain names including haveibeenpaened.com, haveibeenpwnded.com, haveibeenporned.com and my personal favourite, haveibeenprawned.com (because a journo literally pronounced it that way in a major news segment 🤦‍♂️). Not to mention all the other weird variations including haveibeenburned.com, haveigotpwned.com, haveibeenrekt.com and after someone made the suggestion following the revelation that PornHub follows me, haveibeenfucked.com 🤷‍♂️

Press

It's difficult to even know where to start here. How does the little site with the weird name end up in the press? Inevitably, "because data breaches", and it's nuts just how much exposure this project has had because of them. These are often mainstream news events and what reporters often want to impart to people is along the lines of "Here's what you should do if you've been impacted", which often boils down to checking HIBP.

Press is great for raising awareness of the project, but it has also quite literally DDoS'd the service with the Martin Lewis Money Show in the UK knocking it offline in 2016. Cool! No, for real, I learned some really valuable lessons from that experience which, of course, I shared in a blog post. And then ensured could never happen again.

Back in 2018, Gizmodo reckoned HIBP was one of the top 100 websites that shaped the internet as we knew it, alongside the likes of Wikipedia, Google, Amazon and Goatse (don't Google it). Only the year after it launched, TIME magazine reckon'd it was one of the 50 best websites of the year. And every time I do a Google search for a major news outlet, I find this little website. The Wall Street Journal. The Standard (nice headline!) USA Today. Toronto Star. De Telegraf. VG. Le Monde. Corriere della Sera. It's wild - I just kept Googling for the largest newspapers in various parts of the world and kept getting hits!

The point is that it's had impact, and nobody is more surprised about that than me.

Congress

How on earth did I end up here?!

A Decade of Have I Been Pwned

6 years and a few days ago now, I found myself in a place I'd only ever seen before in the movies: Congress. American Congress. Saying "pwned"!

For reasons I still struggle to completely grasp, the folks there thought it would be a good idea if I flew to the other side of the world and talked about the impact of data breaches on identity verification. "You know they're just trying to get you to DC so they can arrest you for all that stolen data you have, right?! 🤣", the internet quipped. But instead, I had one of the most memorable moments of my career as I read my testimony (these are public hearings so it's all recorded and available to watch), responded to questions from congressmen and congresswomen and rounded out the trip staring down at where they inaugurate presidents:

A Decade of Have I Been Pwned

Today, that photo adorns the wall outside my office and dozens of times a day I look at it and ask the same question - how did it all lead to this?!

Svalbard

The potential sale of HIBP was a very painful, very expensive chapter of life, announced in a blog post from June 2019. For the most part, I was as transparent and honest as I could be about the reasons behind the decision, including the stress:

To be completely honest, it's been an enormously stressful year dealing with it all.

More than one year later, I finally wrote about the source of so much of that stress: divorce. Relationship circumstances had put a huge amount of pressure on me and I needed a relief valve which at the time, I thought would be the sale of the project I loved so much but was becoming increasingly demanding. Ultimately, Project Svalbard (the code name for the sale of HIBP), had the opposite effect as years of bitter legal battles with my ex ensued, in part due to the perceived value that would have been realised had it been sold and some big tech company owned my arse for years to come. The project I built out of a passion to do community good was now being used as a tool to extract as much money out of me as possible. There's a wild story to be told there one day but whilst that saga is now well and truly behind me, the scars are still raw.

There were many times throughout Project Svalbard where I felt like I was living out an episode of Silicon Valley, especially as I hopped between interviews at the who's-who of tech firms in San Francisco to meet potential acquirers. But there was one moment in particular that I knew at the time would form an indelible memory, so I took a photo of it:

A Decade of Have I Been Pwned

I'm sitting in a rental car in Yosemite whilst driving from the aforementioned meetings in SF and onto Vegas for the annual big cyber-events. I had a scheduled call with a big tech firm who was a potential acquirer and should that deal go through, the guy I was speaking to would be my new boss. I'd done that dozens of times by now and I don't know if it was because I was especially tired or emotional or if there was something in the way he phrased the question, but this triggered something deep inside me:

So Troy, what would your perfect day in the office look like?

I didn't say it this directly, but I kid you not this is exactly what popped into my mind:

I get on my jet ski and I do whatever the fuck I want

My potential new overlord had somehow managed to find exactly the raw nerve to touch that made me realise how valuable independence had become to me. 6 months later, Project Svalbard was dead after a deal I'd struck fell through. I still can't talk about the precise circumstances due to being NDA'd up to wazoo, but the term we chose to use was "a change of business circumstances on behalf of the purchaser". With the benefit of hindsight, I've never been so happy to have lost so much 😊

The FBI

10 years ago, I certainly didn't see this on the cards:

Nor did I expect them to be actively feeding data into HIBP. Or the UK's NCA to be feeding data in. Or various other law enforcement agencies the world over. And I never envisioned a time where dozens of national governments would be happy to talk about using the service.

A couple of months ago, the ABC wrote a long piece on how this whole thing is, to use their term, a strange sign of the times.

He’s just “a dude on the web”, but Troy Hunt has ended up playing an oddly central role in global cybersecurity.
A Decade of Have I Been Pwned

It's strange until you look at through the lens of aligned objectives: the whole idea of HIBP was "to do good things after bad things happen" which is well aligned with the mandates of law enforcement agencies. You could call it... common ground:

This is something I suspect a lot of people don't understand - that law enforcement agencies often work in conjunction with private enterprise to further their goals of protecting people just like you and me. It's something I certainly didn't understand 10 years ago, and I still remember the initial surprise when agencies started reaching out. Many years on, these have become really productive relationships with a bunch of top notch people, a number of whom I now count as friends and make an effort to spend time with on my travels.

Passwords

This was never on the cards originally. In fact, I'd always been adamant that there should never be passwords in HIBP although in my defence, the sentiment was that they should never appear next to the username to which they originally accompanied. But looking at passwords through the lens of how breach data can be used to do good things, a list of known compromised passwords disassociated from any form of PII made a lot of sense. So, in 2017, Pwned Passwords was born. You know what I was saying earlier about things escalating quickly? Yeah:

As if to make the point, I just checked the latest stats and last week we did 301.6M requests in a single day. 100% of those requests - and that's not a rounded number either, it's 100.0000000000% - were served from Cloudflare's cache 🤯

There's so much I love about this service. I love that it's free, there's no auth, it's entirely open source (both code and data), the FBI feeds data into it and perhaps most importantly, it has real impact on security. It's such a simple thing, but every time you see a headline such as "Big online website hit with credential stuffing attack", a significant portion of the accounts being taken over have passwords that could easily have been blocked.

The Paradox of Handling Data Breaches

On multiple occasions now, I've had conversations that can best be paraphrased as follows:

Random Internet Person: I'm going to report you to the FBI for having all that stolen data

Me: Maybe you should start by Googling "troy hunt fbi" first...

But I understand where they're coming from and the paradox I refer to is the perceived conflict between handling what is usually the output of a crime whilst simultaneously trying to perform a community good. It's the same discussion I've often had with people citing privacy laws in their corner of the world (often the EU and GDPR) as the reason why HIBP shouldn't exist: "but you're processing data without informed consent!", they'll claim. The issue of there being other legal bases for processing aside, nobody consents to being in a data breach! The natural progression of that conversation is that being in a data breach is a parallel discussion to HIBP then indexing it and making it searchable, which is something I've devoted many words to addressing in the past.

But for all the bluster the occasional random internet person can have (and honestly, I could count the number of annual instances of this on one hand), nothing has come of any complaints. And when I say "complaints", it's often nothing more than a polite conversation which may simply conclude with an acknowledgment of opposing views and that's it. There has been one exception in the entire decade of running this service where a complaint did come via a government privacy regulator, I responded to all the questions that were asked and that was the end of it.

People

When you have a pet project like HIBP was in the beginning, it's usually just you putting in the hours. That's fine, it's a hobby and you're scratching an itch, so what does it matter that there's nobody else involved? Like many similar passion projects, HIBP consumed a lot of hours from early on, everything from obviously building the service then sourcing data breaches, verifying and disclosing them, writing up descriptions and even editing every single one of those 700+ logos by hand to be just the right dimensions and file size. But in the beginning, if I'd just stopped one day, what would happen? Nothing. But today, a genuinely important part of the internet that a huge number of individuals, corporations and governments have built dependencies on would stop working if I lost interest.

The dependency on just me was partly behind the possible sale in 2019, but clearly that didn't eventuate. There was always the option to employ people and build it out like most people would a normal company, but every time I gave that consideration it just didn't stack up for a whole bunch of reasons. It was certainly feasible from the perspective of building some sort of valuable commercial entity, but in just the same way as that question about my perfect day in the office sucked the soul from my body, so did the prospect of being responsible for other people. Employment contracts. Salary negotiations. Performance reviews. Sick leave and annual leave and all sorts of other people issues from strangers I'd need to entrust with "my baby". So, bringing in more people was a really unattractive idea, with 2 exceptions:

In early 2021, my (soon to be at the time) wife Charlotte started working for HIBP.

A Decade of Have I Been Pwned

Charlotte had spent the last 8 years working with people just like me; software nerds. As a project manager for the NDC conferences based out of Norway, she'd dealt with hundreds of speakers (including me on many occasions), and thousands of attendees at the best conference I've ever been a part of. Plus, she spent a great deal of time coordinating sponsors, corporate attendees and all sorts of other folks that live in the tech world HIBP inhabited. For Charlotte, even though she's not a technical person (her qualifications are in PR and entrepreneurial studies), this was very familiar territory.

So, for the last few years, Charlotte has done absolutely everything that she can to ensure that I can focus on the things that need my attention. She onboards new corporate subscribers, handles masses of tickets for API and domain subscribers and does all the accounting and tax work. And she does this tirelessly every single day at all sorts of hours whether we're at home or travelling. She is... amazing 🤩

Earlier this year, Stefán Jökull Sigurðarson started working for us part time writing code, cleaning up code, migrating code and, well, doing lots of different code things.

A Decade of Have I Been Pwned

Just today I asked Stefán what I should write about him, thinking he'd give me some bullet points I'd massage and then incorporate into this blog post. Instead, I reckon what he wrote was so spot on that I'm just going to quote the entire thing here:

"Just" that having had my eye on the service since it was released and then developing one of the first big integrations with the PwnedPasswords v2 API in EVE, coinciding with us meeting for the first time at NDC Oslo in 2018 shortly after,  HIBP has managed to take me on this awesome journey where it has been a part of launching my public speaking career, contributing to OSS with Pwned Passwords, becoming an MVP and helped me meet a bunch of awesome people and allowed me to contribute to a better and hopefully safer internet. I'm very happy and honoured to a be a part of this project which is full of awesome challenges and interesting problems to deal with. Having meeting invites from the FBI in my inbox a few years after doing a few experimental rest calls to the Pwned Passwords API in early 2018 was definitely not something I was expecting 😅

What really resonated with me in Stefán's message is that for him, this isn't just a job, it's a passion. His journey is my journey in that we freely devoted our time to do something we love and it led to many wonderful things, including MVP roles and speaking at "Charlotte's" conference, NDC. Stefán is based in Iceland, but we've still had many opportunities to share beers together and establish a relationship that transcends merely writing code. I can't think of anyone better to do what he does today.

Breaches

731 breaches later, here we are. So, what stands out? Just going off the top of my head here:

Ashley Madison. Every knows the name so it needs no introduction, but that incident in 2015 had a major impact on HIBP in terms of use of the service, and also a major impact on me in terms of the engagements I had with impacted parties. My blog post on Here’s what Ashley Madison members have told me still feels harrowing to read.

Collection #1. This is the one that really contributed to my stress levels in early 2019 and had a profound impact on my decision to look at selling the service. Read about where those 773M records came from (still the largest breach in HIBP to date).

Rosebutt. Don't make a joke about it, don't make a joke about it, don't... aw man, thanks The Register! (link to an archive.org version as they seem to have thought better of their image choice later on...) The point is that even serious data breaches can have their moments of levity.

Shit Express. Sometimes, you just need a bit of hilarity in your data breach. Shit Express is literally a site to send other people pieces of that - anonymously - and they got breached, thus somewhat affecting their anonymity. The more serious point is that as I later wrote, claims of anonymity are often highly misleading.

Future

I often joke about my life being very much about getting up each morning, reading my emails and events from overnight and then just winging it from there. Of course there are the occasional scheduled things not to mention travel commitments, but for the most part it's very much just rolling with whatever is demanding attention on the day. This is also probably a significant part of why I don't really want to see this thing grow into a larger concern with more responsibilities, I just don't want to lose that freedom. Yet...

We're gradually moving in a direction where things become more formalised. 3 years ago, I did 100% of everything myself. 1 year ago, I did everything technical myself. 6 months ago, we had no ticketing system for support. But these are small, incremental steps forward and that's what I'd like to see continuing. I want HIBP to outlive me, I just don't want it to become a burden I'm beholden to in the process. I'd like to have more people involved but as you can see from above, that's been a very slow process with only those very close to me playing a role.

The only thing I have real certainty on at the moment is that there will be more breaches. I've commented many times recently that the scourge that is ransomware feels like it's really accelerated lately, I wonder how many of the people in the emails and documents and all sorts of other data that get dumped there ever learn of their exposure? It's a non-trivial exercise to index that (for all sorts of reasons), but it also seems like an increasingly worthy exercise. Who knows, let's see how I feel when I get up tomorrow morning 🙂

Finally, for this week's regular video, I'm going to make a birthday special and do it live with Charlotte. Please come and join us, I'm not entirely sure what we'll cover (I'll work it out on the morning!) but let's make a virtual 10th birthday party out of it 🎂

Presently sponsored by: Identity theft isn’t cheap. Secure your family with Aura the #1 rated proactive protection that helps keep you safe online. Get started.

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Allegedly, Acuity had a data breach. That's the context that accompanied a massive trove of data that was sent to me 2 years ago now. I looked into it, tried to attribute and verify it then put it in the "too hard basket" and moved onto more pressing issues. It was only this week as I desperately tried to make some space to process yet more data that I realised why I was short on space in the first place:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Ah, yeah - Acuity - that big blue 437GB blob. What follows is the process I went through trying to work out what an earth this thing is, the confusion surrounding the data, the shady characters dealing with it and ultimately, how it's now searchable in Have I Been Pwned (HIBP), which may be what brought you to this blog post in the first place.

One of the first things I do after receiving a data breach is to literally just Google it: acuity data breach. Which immediately yielded this top result from June:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Ah, so Acuity is a healthcare company. But wait - here's the next result:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

That's not about healthcare, that's Acuity Brands. How many companies called "Acuity" that have been breached are there?! Let's see what references I have in my email:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Another one 🤦‍♂️ That "breach" could be circumstantial, so we'll call it a "maybe", but it's yet another Acuity with a question mark next to it. So how many "Acuity" companies are out there in total?! Just in the course of investigating this data, I came across a total of 6 of them that as far as I can tell, are completely unrelated:

  1. Acuity Healthcare (definitely breached): acuity.healthcare
  2. Acuity Brands (definitely breached): acuitybrands.com
  3. Acuity Scheduling (maybe breached): acuityscheduling.com
  4. Acuity Insurance: acuity.com
  5. Acuity "Innovative technical solutions for Federal agencies that support the National Security & Public Safety missions": myacuity.com
  6. Acuity Ads: acuityads.com (now redirects to illumin.com)

Ugh, great. We'll work through them and try to figure out where they fit into the picture in a moment, but first let's look at the actual data. We already know it's 437GB, but it's the breadth of column headings that's most stunning; here's all 414 of them:

Just by eyeballing these, it really doesn't feel like the sort of data that comes from a healthcare provider, a brands company or a scheduler. The other 3, however... Maybe.

Some more data points before going further:

  1. The files is named "ACUITY_MASTER_18062020.csv" (this is the date I've elected to stamp the breach with - 18 June 2020)
  2. There are 21,873,706 email addresses in the file
  3. Of those, "only" 14,055,729 are unique so there's some redundancy
  4. The data is cleansed and formatted in a fashion that definitely isn't reflective of how data is entered by end users

On that final point, here's an example of what I'm talking about:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

The last names are the same, as are the salutations. The physical addresses are spot on accurate in their structure as are the phone numbers; there are no spaces, no dashes and no other artifacts typical of millions of different humans entering data. This is clean - too clean.

The "datasource" field is another interesting data point with the top 10 values being:

  1. Buy.com
  2. Popularliving.com
  3. studentsreview.com
  4. TAGGED.COM
  5. jamster.com
  6. Expedia.com
  7. cbsmarketwatch.com
  8. netflix.com
  9. selfwealthsystem.com
  10. gocollegedegree.com

Each of these entries appeared at least hundreds of thousands of times, if not millions. Does that mean that Netflix, for example, provided customer data to this list? Almost certainly no, but it does feel reminiscent of the Acxiom / Live Ramp misattribution post I wrote a year ago where I listed full counts of a similar column. One of the top values there was also "TAGGED.COM" (also all in uppercase), alongside several other values that also appeared in both sources.

Back to attribution and a post on a popular hacking forum jumps out:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Many things here line up, for example the column names that are very unique to this data source, including "estimatedincomecode", "del_point_check_digit" and "secondaryaddresspresent". The attribution is to the insurance company named "Acuity", but is that accurate? Insurance companies collect a lot of data as it's relevant to how they run their business, but that data is highly unlikely to include fields such as:

  1. SpectatorSportsBasketball
  2. SewingKnittingNeedlework
  3. PresenceOfUpscaleRetailCard

That's much more in the "data enrichment" space where a company sells a massive data set so that it can expand the profile data of the purchaser's existing customer base. It's a legitimate, honest, legal business model. It's also indistinguishable from this:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

Hey, it's 437GB! And the column names line up! And it's called Acuity! Slightly different column count to mine (and similar but different to the hacker forum post), and slightly different email count, but the similarities remain striking. How I got to this resource is also interesting, having come by someone I was discussing the data with a couple of years ago:

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

The YouTube video is a walkthrough of a campaign management tool to send emails to customers. Could that indicate the data as coming from Acuity Ads (now Illumin)? No, not in and of itself, the walkthrough there isn't that dissimilar to other campaign tools I've used in the past. No matter how much I looked, I just couldn't find a solid lead back to Acuity Ads and anything even remotely related was merely circumstantial. It could be from them, but it could also be from many other places and the mere fact that a near identical corpus of data was sitting there on an outright spam site only makes the whole mystery that much deeper. There was just one more interesting data point in that email:

i myself am in that dataset and i've been getting 100x more phishing/scam calls, emails, and physical mail

Let me end this with a best guess: this feels like the same situation as the massive Master Deeds incident in South Africa in 2017. In that case, a legally operating data aggregator (I think you know how I feel about those by now...) sold personal information to a real estate business who then left it publicly exposed. I say it feels the same because it's just such a clean set of data and it's clearly very comprehensive in terms of the columns. It's exactly what I'd expect a data aggregator to prepare and sell to other businesses so they could identify which of their existing customers likes needlework.

In the past, publishing blog posts like this has helped identify an origin service and if that happens again here then I'll be sure to provide an update. For now, I've loaded it into HIBP and flagged it as a spam list which means it won't impact the size of anyone's domains and bump them into a different subscription level. If you do have any interesting insights on this data, please leave a comment below and with any luck, one of the Acuity entities out there will emerge as the source.

Note: just after loading the data, I ran the calcs on how many of the addresses were pre-existing in HIBP. This seems like a statistically significant number 😲

Presently sponsored by: Webinar: 'How to Defend Against the Evilginx2.' Kuba Gretzky (Evilginx2) & Marcin Szary (Secfense) show a tool that counters MFA bypass.

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

I like to think of investigating data breaches as a sort of scientific search for truth. You start out with a theory (a set of data coming from an alleged source), but you don't have a vested interested in whether the claim is true or not, rather you follow the evidence and see where it leads. Verification that supports the alleged source is usually quite straightforward, but disproving a claim can be a rather time consuming exercise, especially when a dataset contains fragments of truth mixed in with data that is anything but. Which is what we have here today.

To lead with the conclusion and save you reading all the details if you're not inclined, the dataset so many people flagged me this week titled "Linkedin Database 2023 2.5 Millions" turned out to be a combination of publicly available LinkedIn profile data and 5.8M email addresses mostly fabricated from a combination of first and last name. It all began with this tweet:

All good lies are believable at face value; is it feasible a massive corpus of LinkedIn data is floating around? Well, they were proper breached in 2012 to the tune of 164M records (by which I mean that incident was genuinely internal data such as email addresses and passwords extracted out by a vulnerability), then they were massively scraped in 2021 with another 126M records going into Have I Been Pwned (HIBP). So, when you see a claim like the one above, it seems highly feasible at face value which is what many people take it at. But I'm a bit more suspicious than most people 🙂

First, the claim:

This one is similar to my twitter data scrapped [sic] but for linkedin plus 2023

Now, there's a whole debate about whether scraped data is breached data and indeed whether the definition of it even matters. With the rising prevalence of scraped data, this topic came up enough that I wrote a dedicated blog post about it a couple of years ago and concluded the following in terms of how we should define the term "breach":

A data breach occurs when information is obtained by an unauthorised party in a fashion in which it was not intended to be made available

Which makes scrapes like this alleged one a breach. If indeed it was accurate, LinkedIn data had been taken and redistributed in a way it was never intended to be by either the service itself or the individuals whose data was in this corpus. So, it's something to take seriously, and that warranted further investigation.

I scrolled through the 10M+ rows of data (many records spanned multiple rows due to line returns), and my eyes fell on a fellow Aussie who for the purposes of this exercise we'll call "EM", being the initials of her first and last name. Whilst the data I'm going to refer to is either public by design or fabricated, I don't want to use a real person as an example without their consent so let's just play it safe. Here's a fragment of EM's record:

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

There are 5 noteworthy parts of this I that immediately caught my attention:

  1. There are 5 different email addresses here with the alias for each one represented in "[first name].[last name]@" form. These exist in a column titled "PROFILE_USERNAMES". (Incidentally, this is why the headline of 2.5M accounts expands out to 5.8M email addresses as there are often multiple addresses per account.)
  2. There's a LinkedIn profile ID in the form of "[first name]-[last name]-[random hexadecimal chars]" under a column titled "PROFILE_LINKEDIN_ID". That successfully loaded EM's legitimate profile at https://www.linkedin.com/in/[id]/
  3. The numeric value in the "PROFILE_LINKEDIN_MEMBER_ID" column matched with the value on EM's profile from the previous point.
  4. The 2 dates starting with "2020-" are in columns titled "PROFILE_FETCHED_AT" and "PROFILE_LINKEDIN_FETCHED_AT". I assume these are self-explanatory.
  5. EM's first and last name, precisely as it appears in each of her 5 email addresses.

On its own, this record would be unremarkable. It'd be entirely feasible - this could very well be legit - except when you keep looking through the remainder of the data. A pattern quickly emerged and I'm going to bold it here because it's the smoking gun that ultimately indicates that a bunch of this data is fake:

Every single record with multiple email addresses had exactly the same alias on completely unrelated domains and it was almost always in the form of "[first name].[last name]@".

Representing email addresses in this fashion is certainly common, but it's far from ubiquitous, and that's easy to demonstrate. For example, I have tons of emails from Pluralsight so I dig one out from my friend "CU":

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

There's no dot, rather a dash. Every single real Pluralsight email address I looked at was a dash rather than a dot, yet when I delved into the alleged LinkedIn data and dig out another sample Pluralsight address, here's what I found:

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

That's not LM's real address because it has a dot instead of a dash. Every. Single. One. Is. Fake.

Let's try this the other way around and load up the existing breached accounts in HIBP for the domain of one of EM's alleged email addresses and see how they're formed:

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

That's definitely not the same format as EM's address, not by a long shot. And time and time again, the same pattern of addresses in the corpus of data in the original tweet emerged, drawing me to what seems to be a pretty logical conclusion:

Each email address was fabricated by taking the actual domain of a company the individual legitimately worked at and then constructing the alias from their name.

And these are legitimate companies too because every single LinkedIn profile I checked had all the cues of accurate information and each domain I checked in the corpus of data was indeed the correct one for the company they worked at. I imagine someone has effectively worked through the following logic:

  1. Get a list of LinkedIn profiles whether that be by ID or username or simply parsing them out of crawler results
  2. Scrape the profiles and pull down legitimate information about each individual, including their employment history
  3. Resolve the domain for each company they worked at and construct the email addresses
  4. Profit?

On that final point, what is the point? The data wasn't being sold in that original tweet, rather it was freely downloadable. But per the date on EM's profile, the data could have been obtained much earlier and previously monetised. And on that, the date wasn't constant across records, rather there was a broad range of them as recent as July last year and as old as... well, I stopped when the records got older than me. What is this?!

I suspect the answer may partly lie in the column headings which I've pasted here in their entirety:

"PROFILE_KEY", "PROFILE_USERNAMES", "PROFILE_SPENDESK_IDS", "PROFILE_LINKEDIN_PUBLIC_IDENTIFIER", "PROFILE_LINKEDIN_ID", "PROFILE_SALES_NAVIGATOR_ID", "PROFILE_LINKEDIN_MEMBER_ID", "PROFILE_SALESFORCE_IDS", "PROFILE_AUTOPILOT_IDS", "PROFILE_PIPL_IDS", "PROFILE_HUBSPOT_IDS", "PROFILE_HAS_LINKEDIN_SOURCE", "PROFILE_HAS_SALES_NAVIGATOR_SOURCE", "PROFILE_HAS_SALESFORCE_SOURCE", "PROFILE_HAS_SPENDESK_SOURCE", "PROFILE_HAS_ASGARD_SOURCE", "PROFILE_HAS_AUTOPILOT_SOURCE", "PROFILE_HAS_PIPL_SOURCE", "PROFILE_HAS_HUBSPOT_SOURCE", "PROFILE_FETCHED_AT", "PROFILE_LINKEDIN_FETCHED_AT", "PROFILE_SALES_NAVIGATOR_FETCHED_AT", "PROFILE_SALESFORCE_FETCHED_AT", "PROFILE_SPENDESK_FETCHED_AT", "PROFILE_ASGARD_FETCHED_AT", "PROFILE_AUTOPILOT_FETCHED_AT", "PROFILE_PIPL_FETCHED_AT", "PROFILE_HUBSPOT_FETCHED_AT", "PROFILE_LINKEDIN_IS_NOT_FOUND", "PROFILE_SALES_NAVIGATOR_IS_NOT_FOUND", "PROFILE_EMAILS", "PROFILE_PERSONAL_EMAILS", "PROFILE_PHONES", "PROFILE_FIRST_NAME", "PROFILE_LAST_NAME", "PROFILE_TEAM", "PROFILE_HIERARCHY", "PROFILE_PERSONA", "PROFILE_GENDER", "PROFILE_COUNTRY_CODE", "PROFILE_SUMMARY", "PROFILE_INDUSTRY_NAME", "PROFILE_BIRTH_YEAR", "PROFILE_MARVIN_SEARCHES", "PROFILE_POSITION_STARTED_AT", "PROFILE_POSITION_TITLE", "PROFILE_POSITION_LOCATION", "PROFILE_POSITION_DESCRIPTION", "PROFILE_COMPANY_NAME", "PROFILE_COMPANY_LINKEDIN_ID", "PROFILE_COMPANY_LINKEDIN_UNIVERSAL_NAME", "PROFILE_COMPANY_SALESFORCE_ID", "PROFILE_COMPANY_SPENDESK_ID", "PROFILE_COMPANY_HUBSPOT_ID", "PROFILE_SKILLS", "PROFILE_LANGUAGES", "PROFILE_SCHOOLS", "PROFILE_EXTERNAL_SEARCHES", "PROFILE_LINKEDIN_HEADLINE", "PROFILE_LINKEDIN_LOCATION", "PROFILE_SALESFORCE_CREATED_AT", "PROFILE_SALESFORCE_STATUS", "PROFILE_SALESFORCE_LAST_ACTIVITY_AT", "PROFILE_SALESFORCE_OWNER_CONTACT_ID", "PROFILE_SALESFORCE_OWNER_CONTACT_NAME", "PROFILE_SPENDESK_SIGNUP_AT", "PROFILE_SPENDESK_DELETED_AT", "PROFILE_SPENDESK_ROLES", "PROFILE_SPENDESK_AVERAGE_NPS_SCORE", "PROFILE_SPENDESK_NPS_SCORES_COUNT", "PROFILE_SPENDESK_FIRST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE_SENT_AT", "PROFILE_SPENDESK_PAYMENTS_COUNT", "PROFILE_SPENDESK_TOTAL_EUR_SPENT", "PROFILE_SPENDESK_ACTIVE_SUBSCRIPTIONS_COUNT", "PROFILE_SPENDESK_LAST_ACTIVITY_AT", "PROFILE_AUTOPILOT_MAIL_CLICKED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_CLICKED_AT", "PROFILE_AUTOPILOT_MAIL_OPENED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_OPENED_AT", "PROFILE_AUTOPILOT_MAIL_RECEIVED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_RECEIVED_AT", "PROFILE_AUTOPILOT_MAIL_UNSUBSCRIBED_AT", "PROFILE_AUTOPILOT_MAIL_REPLIED_AT", "PROFILE_AUTOPILOT_LISTS", "PROFILE_AUTOPILOT_SEGMENTS", "PROFILE_HUBSPOT_CFO_CONNECT_SLACK_MEMBER_STATUS", "PROFILE_HUBSPOT_IS_CFO_CONNECT_MEETUPS_MEMBER", "PROFILE_HUBSPOT_CFO_CONNECT_AREAS_OF_EXPERTISE", "PROFILE_HUBSPOT_CORPORATE_FINANCE_EXPERIENCE_YEARS_RANGE"

Check out some of those names: LinkedIn is obviously there, but so is Salesforce and Spendesk and Hubspot, among others. This reads more like an aggregation of multiple sources than it does data solely scraped from LinkedIn. My hope is that in posting this someone might pop up and say "I recognise those column headings, they're from..." Who knows.

So, here's where that leaves us: this data is a combination of information sourced from public LinkedIn profiles, fabricated emails address and in part (anecdotally based on simply eyeballing the data this is a small part), the other sources in the column headings above. But the people are real, the companies are real, the domains are real and in many cases, the email addresses themselves are real. There are over 1.8k HIBP subscribers in the data set and this is folks that have double opted-in so they've successfully received an email to that address in the past. Further, when the data was loaded into HIBP there were nearly a million email addresses that were already in the system so evidently, they were addresses that had previously been in use. Which stands to reason because even if every address was constructed by an algorithm, the pattern is common enough that there'll be a bunch of hits.

Because the conclusion is that there's a significant component of legitimate data in this corpus, I've loaded it into HIBP. But because there are also a significant number of fabricated email addresses in there, I've flagged it as a spam list which means the addresses won't impact the scale of anyone's paid subscription if they're monitoring domains. And whilst I know some people will suggest it shouldn't go in at all, time and time again when I've polled the public about similar incidents the overwhelming majority of people have said "we want to know about it then we'll make up our own minds what action needs to be taken". And in this case, even if you find an email address on your domain that doesn't actually exist, that person who either currently works at your company or previously did has still had their personal data dumped in this corpus. That's something most people will still want to know.

Lastly, one of the main reasons I decided to invest hours into this today is that I loathe disinformation and I hate people using that to then make statements that are completely off base. I'm looking at my Twitter feed now and see people angry at LinkedIn for this, blaming an insider due to recent layoffs there, accusing them of mishandling our data and so on and so forth. No, not this time, the evidence has led us somewhere completely different.