Nephew - Jobs

AI Engineer at Nephew

💰 $100,000 - $180,000 🌍 New York, New York 📅 05/27/2026

Apply

Job Description

### About the Project

Nephew is the chief of staff small business owners can't otherwise afford. It
lives in SMS, handles the inbox, manages bookings, follows up on leads,
reconciles invoices, sends reminders, files things in the right place. The
customer is a plumber, a dentist, a vet, a hairdresser, someone running a real
business between jobs, who texts lowercase and short. We're building the agent
that does the work they keep meaning to get to.

A small team, a wide surface, and a customer base that's pulling us forward
faster than we can build for them.

### What You'll Actually Do

**Build the agent runtime:** The part of Nephew that turns "follow up with the
three quotes from last week" into real messages that landed in real inboxes
and checks its own work before anything goes out. The agent has memory, acts
on its own initiative when the moment calls for it, and stays inside the
audience model whether an owner messaged first or not. You'll own how all of
that fits together.

**Author skills:** Skills are the unit of composable behavior, how we teach
Nephew to do a new shape of work. Booking flows, invoice chasing, lead triage,
daily reflection, whatever the next vertical needs. You'll write them, refine
them, and kill the ones that don't earn their tokens.

**Build the eval and observability layer:** This is where we catch regressions
before owners feel them. Build the datasets, score what matters, watch the
production traces, and close the loop from "this looked wrong" back into the
skill that produced it. If something drifts, you want to know before the owner
does.

**Ship integrations:** Connect Nephew to the tools small businesses actually
use. The work is auth flows, webhook plumbing, mapping someone else's schema
into ours, and handling the failures real APIs hand you. The proof point is a
live owner's calendar on the first try, not a recorded fixture, not a demo
account.

**Everything else:** A small team covers a lot of ground. SMS is the surface
now; web, email, and voice are queued behind it. Plenty of the work that
matters most this quarter isn't in this list.

### The Bar

Something you wrote goes live most days. A skill you authored is the reason an
owner stopped manually chasing invoices. When Nephew sends something weird at
7am on a Saturday, you can trace it end to end, runtime, skill, memory,
integration and fix it before lunch. The founders are writing less code
because you've taken over surfaces they used to own.

Agentic coding has to already be how you work. We're not going to teach you
the workflow on the side; the people we're hiring built parts of it.

### Day-1 Reality

Day 1: laptop open, repo cloned, system prompt in one tab, production traces
in another. You read a real owner conversation from yesterday, the actual SMS
thread, the actual memory writes, the actual skill that fired. You find
something that's off. A reply that was technically correct but landed wrong. A
skill that loaded when it shouldn't have. A memory write that buried something
the agent needed.

Week 1: you ship the fix as a PR, it merges, it goes live, and an owner gets
the better version the next time they text. You'll have talked to a founder
about what to build next. You'll own a surface by the end of the week.

### Who You Are

An agent harness has your fingerprints on it. Maybe you wrote one, maybe you
tore a working one apart and rebuilt the parts that didn't fit. Either way,
you can sit down and explain which trade-offs you made and what you'd do
differently now.

You've gotten an agent to check its own work instead of hallucinating success.
The flavor doesn't matter, evals, replay, self-critique, post-hoc
verification, browser confirmation, whatever closed the loop. What matters is
that you've fought that fight and won it more than once.

The tools you reach for include ones you built yourself. Custom commands,
internal CLIs, MCP servers, harnesses tuned to your own brain. You don't take
the default toolchain as final.

**Taste for the audience:** You can read a draft SMS and tell whether a
plumber between jobs would understand it. You don't add a "decompose goal"
flow because the owner doesn't think that way. You feel the audience model in
your bones, not as a rule you check against.

**Comfortable reading production:** You can sit in a trace, follow a long
thread of skill calls and memory writes, and find the place where the agent
went off the rails. Production data is where you do your best thinking, not a
place you visit when something's broken.

**Systems instincts:** You see where something will fail before it does, and
you make the trade-off between robust and fast on purpose, not by accident.

**Frontier-aware:** You've used the current generation of models, harnesses,
and orchestration patterns enough to have a working theory of which ones to
reach for and when. The theory updates when the frontier moves.

**Range:** A morning on the agent runtime, an afternoon on a new skill, an
evening on an eval or an integration. There aren't enough of us for
specialists.

**Speed at altitude:** Same day to production is the norm. We don't sit on
things. We also don't ship slop, the bar is high and the cycle is short, both
at once.

**Curious about how this stuff actually works:** You've read papers and
someone else's prompt files for the same reason you read code: to learn from
the people pushing on the same problem.

**Remote-first:** Wherever you're sharpest. We default to async, ship in
public channels, and pull the team together in person a few times a year.

### Why This Role Is Different

You and the founders, no one in between. The conversation about what to build
next happens between the people who'll build it. There's no PM layer to
translate, no ticket queue to live inside.

What you're building is the product itself. Nephew isn't an AI feature on top
of some older thing, the agent is what the owner pays for. So a skill that
lands cleaner, a memory write that recalls the right thing later, an
integration that doesn't drop the ball, those move retention, not some metric
two steps removed from it.

The customer talks back, fast and unedited. Small business owners don't soften
feedback. If Nephew sent something off, you'll hear about it, often in
lowercase, often within the hour. That's the fastest signal you'll ever build
against.

Mornings to afternoons, not quarters to quarters. You write a skill before
lunch, an owner uses it after, and you know whether it worked by the time you
log off.

### Even Better If

You've spent serious time on AI engineering already, agents in production,
harnesses you understand from the inside, eval infrastructure you've actually
used to catch something.

You've shipped something to a non-technical audience and felt the difference.
Consumer, SMB, anything where the user doesn't read your error messages.

You've built memory or state systems for agents and have opinions about what
works at scale, how to keep what matters, surface it at the right moment, and
avoid the failure modes that show up six months in.

Previous founder or early-stage builder. You've worn every hat and liked it.

Open source contributions to the AI tooling we live in.

### Tech

Python for the agent and the skills. LangSmith for evals and observability.
Pipedream for integrations. The runtime itself is TypeScript and lives in its
own repo. You don't need to know all of it coming in. You need to learn fast.

### How we work

A handful of people, a lot of trust, almost no process. Whoever owns the
surface decides. Your first PR ships in your first week. There's no recurring
meeting where we align on alignment; we build, watch what happens, and adjust.

The agent's behavior is version-controlled. If it governs what Nephew does,
it's a file you can edit, review, and merge, not a setting you toggle in
someone else's UI.

Ownership here means a piece of the product, not a column on a board. The
thing you own is something owners actually depend on between jobs. If it
breaks, you're the one fixing it. If it lands, it's clear whose work made that
happen.

### Why Nephew

Small business owners have been waiting for software that meets them where
they are. Most of what's been built for them assumes they think like founders.
They don't. We do. The thing they actually need is a coworker who can text and
we're early enough that the person who shows up next gets to decide a lot of
what that coworker becomes.