Why you should join us

Solstice is redefining how life sciences organizations commercialize their therapeutics. We are building a commercial engine that allows pharmaceutical marketers to launch campaigns at 100x the speed.

Rapid growth: Over the past year, we have been working with some of the world’s top life sciences manufacturers including Pfizer and AstraZeneca and are working with over 50+ pharma brands.

Frontiers of technology: We're building applications and rapidly iterating technical approaches at the frontiers of AI, experimenting with LLMs and new frameworks. We are incorporating new, advanced frameworks into our solution to better define the possibilities of pharmaceutical marketing.

Top-tier investors: We've raised from investors like Transformation Capital, Twelve Below, Virtue and the founders of Datavant, Commure, and Paradigm to supercharge our growth and build an elite team of engineers and operators.

Building anything great requires commitment and dedication. We're looking for someone who has the fire to prove others wrong.

About the role

We're hiring our first dedicated QA lead to own quality for the AI that powers Solstice. Our platform generates regulated pharmaceutical marketing content for the brands we work with, so when the output is wrong, say an unsupported claim or a missing safety disclosure, it becomes a real compliance problem and not just a bug to file.

What makes this hard is that the system is probabilistic. The same prompt can return different answers, "correct" is often a judgment call, and the exact-match assertions that traditional QA relies on don't apply. We need someone who can measure quality anyway and build the evals that catch regressions before a client does, so the rest of the team can keep moving quickly.

This role is about more than the models, though. Just as much of what keeps customers happy is ordinary product reliability: that the app does what it should, and that a frontend tweak or a new feature doesn't quietly break something in production. You'll own that side too, with a solid end-to-end suite in Playwright and the hands-on manual testing that catches what automation misses.

This is a senior, hands-on engineering job. Most of your time goes to writing code and building test infrastructure, but you'll also dig into manual testing whenever that's the fastest way to find a problem. Your work will reach across the whole product, from the backend services to the frontend customers use every day.

What you'll be working on

Build our evaluation systems. Because we can't check an output against a single correct answer, you'll design the evals that score quality instead and decide, with evidence, what is good enough to ship.
Make models and prompt changes safe. We swap models and rewrite prompts constantly. Your tooling should flag a drop in quality, a jump in cost, or a latency regression before a customer runs into it.
Test the agents for the ways they actually fail. Agents drift off their goal, loop on the same tool call, pick the wrong tool, or get hijacked by a malicious instruction buried in a document we ingest. Those are the cases you'll design for.
Protect the compliance-critical paths. The checks that keep an unsupported claim or a missing disclosure out of a finished asset are the ones that matter most, and you'll own how we test them, including verifying claims against approved source material.
Own end-to-end testing across the app. Build and maintain a Playwright suite that exercises the real user flows, from login through content creation and review, so a frontend or API change can't quietly break something a customer depends on.
Run hands-on manual and exploratory QA. Automation misses things, especially on new features and messy UI states. You'll test releases by hand, dig for the edge cases, and be the last set of eyes before we ship.
Get CI/CD quality gates in place. Today nothing runs automatically when someone opens a pull request: no tests, no linting, no type checks. Building that is yours.
Use production as a test bed. We already trace and monitor what the system does once it's live. You'll turn those signals into drift detection and into new regression tests whenever something slips past us.
Harden the background jobs. A lot of our work runs in long pipelines, so they need to survive retries, timeouts, and worker crashes without dropping or duplicating work.
Set the testing bar. As our first QA hire, you'll define what good testing looks like here and help the rest of the team write code that's easy to trust.

What we're looking for

This is an engineering role first. You should be comfortable building testing and evaluation tools in code, and equally comfortable rolling up your sleeves for hands-on manual testing when that's what the situation needs.

Must-haves

Strong Python, and real experience building test infrastructure and getting it to run automatically in CI/CD.
Strong end-to-end and UI test automation, especially with Playwright.
A genuine manual and exploratory QA discipline. You can test a feature by hand, find the edge cases, and own release sign-off.
Experience testing non-deterministic, ML, or LLM-based systems, or the appetite to build that capability from scratch.
Comfort with evaluation methods: golden datasets, LLM-as-judge (rubric, pairwise, reference-based), and calibrating those judges against human or expert labels.
A statistical way of thinking about quality (variance, pass@k, regression detection) instead of simple pass/fail.
An instinct for error analysis: reading traces, grouping failures by theme, and turning the ones that matter into permanent tests.
Independent thinking and a strong sense of ownership. We can teach specifics, but you should be able to make calls on your own and build a quality function from nothing.
Clear communication. You can explain technical risk to non-technical people, and you both give and take direct feedback.
A serious work ethic. Diamonds were not made overnight.

Strongly preferred

Hands-on experience with agent or LLM frameworks (LangChain, LangGraph) and a feel for how agentic systems break.
Experience with eval and LLM-observability tools (LangSmith, Langfuse, Arize Phoenix, Braintrust, RAGAS, Promptfoo, OpenAI Evals, or similar).
Comfort in a modern frontend stack (TypeScript and React) so you can write meaningful UI tests and reproduce bugs quickly.
Adversarial or red-team testing, including catching outputs that work but cross a safety or regulatory line.
Backend experience with async Python services and task queues.

Bonus points

Experience in a regulated industry such as pharma, healthcare, or finance.
MLOps or LLMOps experience, including defining quality SLOs.

Benefits

Health, dental, and vision insurance
Ground-floor equity opportunity
401(k) w/ match
Visa sponsorship (O-1, H-1B, TN) available
Professional growth stipend
Relocation support
Work in a high-velocity, high-impact environment

Competitive NYC Compensation

This role offers a highly competitive NYC salary in the $160,000 to $300,000 range, calibrated to reflect experience, seniority, and the level of ownership you'll take on as part of the early team. We benchmark against top-tier startups and established tech companies to ensure our compensation is both fair and compelling, and we maintain flexibility at the upper end of the range for exceptional candidates who can meaningfully accelerate our product and engineering roadmap.

Member of Technical Staff (QA Engineer - Agentic Systems)

Description

Why you should join us

About the role

What you'll be working on

What we're looking for

Must-haves

Strongly preferred

Bonus points

Benefits

Competitive NYC Compensation

Similar Jobs

Member of Talent Staff (GTM)

Member of Talent Staff (Technical)

Senior Portfolio Implementation Analyst

Launch Operations Principal