SOC2: The Screenshots Will Continue Until Security Improves

Author

Name: Thomas Ptacek
@tqbf: @tqbf

Fly.io-themed demonology-style sigils — Image by Annie Ruygt

Fly.io runs apps close to users by taking containers and upgrading them to full-fledged virtual machines running on our own hardware around the world. We’ll come right to the point: if you were waiting for us to be SOC2-compliant before giving us all your money, well, we’re SOC2 now, so take us for a spin and make your checks payable to Kurt.

If you’re off getting your app up and running on Fly.io and finding your checkbook, great! I won’t get in your way. The rest of you, though, I want to talk to you about what SOC2 is and how it works.

Spoiler: the SOC2 Starting Seven post held up pretty well.

SOC2 is the best-known infosec certification, and the only one routinely demanded by customers. I have complicated feelings about SOC2, which you will soon share. But also, a few years ago, I wrote a blog post about what startups need to do to gear up for SOC2. Having now project-managed Fly.io’s SOC2, I’d like to true that post up, since I’m officially a leading authority on the process.

SOC2 is worth talking about. It’s arcane in its particulars. Startups that would benefit from SOC2 are held back by the belief that it’s difficult and expensive to obtain. Consumers, meanwhile, split down the middle between cynics who’re certain it’s worthless and true-believers who think it sets the standard for how security should work.

Everybody would be better off if they stopped believing what they believe about SOC2, and started believing what I believe about SOC2.

I’m a Customer. What Should I Know?

Bottom-line: SOC2 is a weak positive indicator of security maturity, in the same ballpark of significance as a penetration test report (but less significant than multiple pentest reports).

SOC2 is an accounting-style, or “real”, audit. That means it confirms on-paper claims companies make about their security processes. They’re nothing at all like “security audits” or “penetration tests”, which are heavily adversarial, engineering-driven, and involve giving third parties free rein to find interesting problems.
SOC2 is about the security of the company, not the company’s products. A SOC2 audit would tell you something about whether the customer support team could pop a shell on production machines; it wouldn’t tell you anything about whether an attacker could pop a shell with a SQL Injection vulnerability.
The guts of a SOC2 audit are a giant spreadsheet questionnaire (the “DRL”) and a battery of screenshots serving as evidence for the answers in the questionnaire.
The SOC2 DRL is high-level, abstract, and the product of accounting industry standards like the COSO framework; it understands things like “git” and “multi-factor authentication”, but nothing lower-level than that.
There are two kinds of SOC2 reports, done in sequence: the “Type I”, which takes a point-in-time snapshot of a company’s processes, and the “Type II”, which confirms over several months that the company consistently adhered to those processes. You can’t “flunk” a Type I audit. But it’s more annoying to pass a Type II, because you can’t travel back in time to close a gap you had last year.

I’m underselling SOC2. It assures some things pentests don’t:

consistent policies for who in the company gets access to what
that everyone in the company has to log in with 2FA
the capability, at least, to fire alerts when weird things show up in logs
severed employees reliably have their access terminated
that the company is not actually run by 4 raccoons in a trenchcoat (or, if it is, that suitable policy has been written documenting and accepting the risk)

Depending on the kind of company you’re looking at, a SOC2 certification might be more or less meaningful. Intensely technical company? High-risk engineering? Look for the pentest. Huge number of employees? Get the SOC2 report.

So: if you’re clicking on SOC2 blog posts because you’re wondering how seriously you should take SOC2, there’s your answer. Go in peace.

The rest of you, buckle up.

What’s SOC2?

You might care about transaction integrity if you’re a Stripe, or confidentiality if you do e-Discovery for lawsuits, or privacy if you’re Equifax.

There’s a structure to the things you claim in SOC2, the AICPA’s “Trust Services Criteria” and something called “the COSO framework”. These are broken down into categories: security, availability, transaction integrity, confidentiality, and privacy. “Security” is mandatory, and is the only one that matters for most companies.

”DRL” is “Document Request List”

The ground truth of SOC2 is something called the DRL, which is a giant spreadsheet that your auditor customizes to your company after a scoping call. You can try to reason back every line item on the DRL to first accounting principles, but that’d be approximately as productive as trying to reason about contract law from first principles after paying a lawyer to review an agreement. Just take their word for it.

With me so far? SOC2. It’s a big spreadsheet an accounting firm gives you to fill out.

When Should You SOC2?

We waited as long as we felt we could.

Careful, now. “Getting SOC2-certified” isn’t the same as “doing the engineering work to get SOC2-certified”. Do the engineering now. As early as you can. The work, and particularly its up-front costs, scale with the size of your team.

The audit itself though doesn’t matter, except to answer the customer request “can I have your SOC2 report?”

So, “when should I SOC2?” is easy to answer. Do it when it’s more economical to suck it up and get the certification than it is to individually hand-hold customer prospects through your security process.

There’s a reason customers ask for SOC2 reports: it’s viral, like the GPL. The DRL strongly suggests you have a policy to “collect SOC2 reports from all your vendors”. Some SOC2 product vendors offer automated security questionnaires for companies to fill out, and they ask for SOC2 reports as well. Your customers ask because they themselves are SOC2, and the AICPA wants them to force you to join the coven.

That doesn’t mean you have to actually do it. If you can speak confidently about your security practice, you can probably get through anybody’s VendorSec process without a SOC2 report. Or you can pay an audit firm to make that problem go away.

It makes very little sense to get SOC2-certified before your customers demand it. You can get a long way without certification. If it helps, remember that you can probably make a big purchase order from that Fortune 500 customer contingent on getting your Type I in the next year.

What SOC2 Made Us Do

We started preparing for SOC2 more than a year before engaging an auditor, following the playbook from the “Starting 7” blog post. That worked, and I’m glad we did it that way. But to keep this simpler to read, I’m just going to write out all the steps we took as if they happened all at once.

Single Sign-On: Every newly-minted CSO I’ve ever asked has told me that SSO was one of the first 3 things they got worked out when they took the position. Put compliance aside, and it’s just obvious why you want a single place to control credentials (forcing phishing-proof MFA, for instance), access to applications, and offboarding. We moved everything we could to our Google SSO.

Inside our network, we also use a certificate-based SSH access control system (I’ll stop being coy, we use Teleport). To reach Teleport, you need to be on our VPN; to get to our VPN, you need Google SSO. Teleport, however, is authenticated separately, via Github’s SSO. So, for SSH, we have two authentication sources of truth, both of which need to work to grant access.

In addition to SSO, Teleport has the absolutely killer feature of transcript-level audit logs for every SSH session; with the right privilege level, you can watch other team members sessions, and you can go back in time and see what everyone did. This has the knock-on benefit of providing a transcript-level log of any REPL anyone has to drop into in prod.

There is, infamously, an “SSO tax” that companies pay to get access to the kinds of SAAS accounts that support SAML or OIDC. I have opinions about the SSO tax. It’s definitely a pain in our asses. If it’s early days for your company, for SAAS vendors that don’t deal in sensitive information, you can skip the SSO and just restrict who you give access to for the app. But mostly, you should just suck it up and pay for the account that does SSO.

Protected Branches: I was surprised by how important this was to our auditors. If they had one clearly articulable concern about what might go wrong with our dev process, it was that some developer on our team might “go rogue” and install malware in our hypervisor build.

It’s easy to enable protected branches in Github. But all that does is make it hard for people to push commits directly to main, which people shouldn’t be doing anyways. To get the merit badge, we also had to document an approval process that ensured changes that hit production were reviewed by another developer.

This isn’t something we were doing prior to SOC2. We have components that are effectively teams-of-one; getting reviews prior to merging changes for those components would be a drag. But our auditors cared a lot about unsupervised PRs hitting production.

We asked peers who had done their own SOC2 and stole their answer: post-facto reviews. We do regular reviews on large components, like the Rust fly-proxy that powers our Anycast network and the Go flyd that drives Fly machines. But smaller projects like our private DNS server, and out-of-process changes like urgent bug fixes, can get merged unreviewed (by a subset of authorized developers). We run a Github bot that flags these PRs automatically and hold a weekly meeting where we review the PRs.

Centralized Logging: A big chunk of the SOC2 DRL is about monitoring systems for problems. Your auditors will gently nudge you towards centralizing this monitoring, but you’ll want to do that anyways, because every logging system you have is one you’ll have to document, screenshot, and write policy about.

We got off easy here, because logging is a feature of our platform; we run very large ElasticSearch and VictoriaMetrics clusters, fed from Vector and Telegraf, and we’re generally a single ElasticSearch query away from any event happening anywhere in our infrastructure.

One thing SOC2 did force us to do was pick a ticketing system, which is something we’d done our best to avoid for several years. We send alerts to Slack channels and PagerDuty, and then have documented processes for ticketing them.

Another thing that surprised me was how much SOC2 mileage we got out of HelpScout. HelpScout is where our support mails (and security@ mails) go to, and while I’m not a HelpScout superfan, it is a perfectly cromulent system of record for a bunch of different kinds of events SOC2 cares about, like internal and external reports of security concerns.

CloudTrail: We barely use anything in AWS other than storage. We compete with AWS! We run our own hardware! But SOC2 audit firms have spent the last 10 years certifying AWS-backed SAAS companies, and have added a whole bunch of AWS-specific line-items to their DRLs. We’re now much better at indexing and alerting on CloudTrail than we were before we did SOC2. It’s too bad that’s not more useful to our security practice.

Infrastructure-as-Code: Your auditor will probably know what Terraform and CloudFormation are, and they will want you to be using it. Your job will be documenting how your own deployment system (the bring-up for new machines in your environment) is similar to Terraform. Sure, whatever.

An annoyance I did not see coming from previous experience was host inventory. Inventory is trivial if you’re an AWS company, because AWS gives you an API and a CLI tool to dump it. We run our own hardware, and while we have several different systems of record for our fleet of machines, they’re idiosyncratic and don’t document well; we ended up taking screenshots of SQL queries, which wasn’t as smooth as just taking a screenshot of our Tailscale ACLs or Google SSO settings.

MDM and Endpoint Management: Here’s a surprise: in the year of our lord 2022, doing endpoint security in your SOC2 has fallen out of fashion. We were all geared up to document our endpoint strategy, but it turned out we didn’t have to.

I should have some snarky bit of “insight” to share about this, but I don’t, and mostly all I can tell you is that you can probably cross this off your checklist of big projects you’ll need to get done simply to get a SOC2 certification. You should do the work anyways, on its own merits.

SOC2’s company control standards are firmly rooted in the accounting scandals of 2001, and it shows.

Boring Company Stuff: I’d been mercifully insulated from this aspect of SOC2 in my former role as engineering support for SOC2’s, but had no such luck this time. I knew there was a lot of annoying company documentation involved in doing SOC2. I won’t put you to sleep with many of the details; if you’re the kind of company that should get a SOC2, the company and management stuff isn’t going to be an obstacle.

We had three “boring company stuff” surprises that stick out:

First, We needed a formal org chart posted where employees could find it. We’re not a “titles and management” kind of company (we’re tiny), so this was a bother. But, whatever, now we have an org chart. Exciting!

Next, our auditors wanted to see evidence of annual performance reviews. We don’t do annual performance reviews (we’re a continuous feedback, routine scheduled 1-on-1 culture). But if you’re not doing annual performance reviews, the AICPA can’t give assurances that an employee who exfiltrated our production database to Pastebin would be terminated. So now we have pro-forma annual reviews.

This kind of SOC2 thing falls under the category of “things you need to carefully explain to your team so they don’t think you’ve suddenly decided to start stack ranking everyone”.

Finally, background checks.

Background checks are performative and intrusive. Ask around for horror stories about how they flag candidates for not being able to source the right high school transcripts. Also, for us, they’re occasionally illegal: we have employees around the world, including several in European jurisdictions that won’t allow us to background check.

This is the only issue we ended up having to seriously back-and-forth with our auditors about. We held the line on refusing to do background checks, and ultimately got out of it after tracking down another company our auditors had worked with, finding out what they did instead of background checks, stealing their process, and telling our auditors “do what you did for them”. This worked fine.

Policies: You’re going to write a bunch of policies. It’s up to you how seriously you’re going to take this. I can tell you from firsthand experience that most companies phone this in and just Google [${POLICY} template], and adopt something from the first search engine result page.

2019 Thomas would have done the same thing. But actually SOC2'ing a company I have a stake put me in a sort of CorpSec state of mind. You read a template incident response or data classification policy. You start thinking about why those policies exist. Then you think about would could go wrong if there was a major misunderstanding about those policies. Long story short, we wrote our own.

This part of the process was drastically simplified by the work Ryan McGeehan has published, for free, on his “Starting Up Security” site. We have, for instance, an Information Security Policy. It’s all in our own words (and ours quotes grugq). But it’s based almost heading-for-heading on Ryan’s startup policy, which is brilliant.

One thing about writing policies inspired by Ryan’s examples is that it liberates you to write things that make sense and read clearly and don’t contain the words “whereas” or “designee”. Ryan hasn’t published a Data Classification policy. But our Data Classification policy was easy to write, in a single draft, just by using Ryan’s voice.

If you’re concerned about what you’re up against here, we ended up writing: (1) an Information Security Policy, which everyone on our team has to sign, (2) a Data Classification Policy that spells out how to decide which things can go in Slack or Email and which things you have to talk to management before transmitting at all, (3) a Document Retention policy (OK, this one I just sourced directly from our lawyers), (4) a Change Management policy, (5) a Risk Assessment policy, which says that sometime this year we will build a Risk Assessment spreadsheet explaining how we’ll handle a zombie apocalypse (you think I’m joking), (6) a Vulnerability Management policy that roughly explains how to run nmap, (7) an Access Request policy that tells people which Slack channel to join to ask for access to stuff, (8) a Vendor Management policy that propagates the SOC2 virus to all our vendors, (9) an Incident Response policy, which we cribbed from Ryan, (10) a Business Continuity plan that says that Jerome is in charge if Kurt is ever arrested for robbing a bank, and (11) an employee handbook.

I want to say more about our Access Management policy, which I shoplifted, at least in spirit, from a former client of mine that now works with us at Fly.io, but this is getting long, so you should just bug me about it online.

We use Slab as our company wiki / intranet. It’s great. Slab surprised me midway through the audit with a feature for “verifying” pages, which might be the highest ROI feature/effort ratio I’ve come across: I click a button and Slab adds a green banner to a page saying that it’s “verified” for the next several months. That’s it, that’s the feature. Several DRL line items are about “recertifying” policies annually, and this gave us the screenshots we needed for that. Well done, Slab! Betcha didn’t even realize you implemented policy recertification.

What We Didn’t Let SOC2 Make Us Do

Ordered from most to least surprising (that is, there was no way we were going to do the stuff at the bottom of the list).

Install endpoint software. See above.
Run a vulnerability scanner. Our attack surface is overwhelmingly comprised of software we wrote ourselves, and we’re not cool enough yet to have Nessus checks for our code. We took a note from CloudFlare and, for compliance-driven scanning, just automated nmap.
Actually collect SOC2 reports from all of our vendors, or document why we didn’t. We’ll have to get that done soon-ish, but it wasn’t a blocker for the certification.
Clear any of the random Prototype Injection vulnerability alerts Dependabot generates for any project we have that uses Javascript.
Install any agent software on any server anywhere. I’m terrified of agents. Also: of work. The work we allowed ourselves to do for SOC2 was, uh, carefully curated. Software to generate extra line items for us to remediate? Not helpful for our process.
Run antivirus on our servers. We did have to explain why that didn’t make any sense, but it boiled down to documenting our deployment and CI/CD systems.
Run any other kind of security product, be it IDS or WAF or SEM or DAST or ASM or XDR or CASB or anything with “mesh” in the name.

The Audit Itself

If you talk to people who’ve done SOC2 before, you’ll hear a lot of joking about screenshots. They’re not joking.

The whole SOC2 audit is essentially a series of screenshots. This is jarring to people who have had “security audits” done by consulting firms, in which teams of technical experts jump into your codebase and try to outguess your developers and uncover flaws in their reasoning. Nothing like that happens in a SOC2 audit, because a SOC2 audit is an actual audit.

Instead, DRL line items are going to ask for evidence supporting claims you make, like “our production repositories on Github have protected branches enabled so that random people can’t commit directly to main” . That evidence will almost always take the form of one or more screenshots of some management interface with some feature enabled.

And that’s basically it? We divided the evidence collection stage of the audit up into a series of calls over the course of a week, each of which ate maybe twenty minutes of our time, most of which was us sharing a screen and saying “that checkbox over there, you should screenshot that”. I was keyed up for this before the calls started, prepared to be on my A-game for navigating tricky calls, and, nope, it was about as chill a process as you could imagine.

So, We’re Secure Now, And You Could Be Too

This was a lot of words, but SOC2 gives a lot of people a lot of feels, and I’d wished someone had written something like this down before I started doing SOC2 stuff.

The most important thing I can say about actually getting certified is to keep your goals modest. I’ve confirmed this over and over again with friends at other companies: the claims you make in your Type I will bind on your Type II; claims you don’t make in your Type I won’t. It stands to reason then that one of your Type I goals should be helping your future self clear a Type II.

I was a little concerned going into this that the quality of our SOC2 report (and our claims) would be an important factor for customers. And, maybe it will be. We got good auditors! I like them a lot! It wasn’t a paint-by-numbers exercise! But in talking to a couple dozen security people at other companies, my takeaway is that for the most part, having the SOC2 report is what matters, not so much what the SOC2 report says.

I can’t not mention this, even though our auditors might see it and preemptively refuse to do this for us.

At least one peer, at a highly clueful, highly security-sensitive firm, described to us a vendor that had given them not one, not two, but five consecutive Type I reports. It is possible to synthesize excited bromide in an argon matrix! You can skip all the real work in SOC2!

I’ve spent the last several weeks trying to convince Kurt that we’re not going to do this. You’ll know how I fared sometime next year.

We’ll have more to say about pentesting soon.

We do a lot of security work that SOC2 doesn’t care about; in fact, SOC2 misses most of our security model. We build software in memory-safe languages, work to minimize trust between components, try to find simple access control models that are easy to reason about, and then get our code pentested by professionals.

SOC2 doesn’t do a good job of communicating any of that stuff. And that’s fine; it doesn’t have to. We can write our own security report to explain what we do and how our security practice is structured, which is something I think more companies should do; I’d rather read one of those than a SOC2 report.

And for all my cynicism, SOC2 dragged us into some process improvements that I’m glad we’ve got nailed down. It helped to have clueful auditors, and a clear eye about what we were trying to accomplish with the process, if you get my drift.

I expected “entire new cloud provider” to be a complicated case for SOC2 audits. But the whole thing went pretty smoothly (and the un-smooth parts would have hit us no matter what we were building). If it was easy for us, it’ll probably be easier for you. Just don’t do it until you have to.

Next post ↑: Logbook - 2022-07-18
Previous post ↓: The Serverless Server