merged · closed beta · 2026

Technical screening — without interviews.

Instead of Leetcode — one calibrated task in a real repo. The candidate opens a pull request. The system scores it automatically: tests, diff shape, commit quality, responses to review.

for HR managers

~2 minPR scoring time
87%rubric accuracy
0 hrsof senior time
pull request · #42open
@@ src/billing/invoice.ts @@
   const amount = base * qty;
−  const tax = amount * 0.2;
+  const tax = calcTax(amount, country);

+  // edge case: UA VAT exemption
+  if (country === 'UA' && isExempt(plan)) {
+    return amount;
+  }
   return amount + tax;
testsCI passed87/87
diffDiff focus3 files, +24 −4
llmRubric (LLM judge)4.6 / 5.0
senior · legacy-invoicePASS

Problem

Technical interviews are broken. Everyone knows it, and everyone keeps doing them.

Leetcode measures Leetcode prep. System design — the skill of drawing boxes. Behavioral — the skill of telling STAR stories. None of them show how a person actually works day to day.

And in 2026 even that illusion of signal is gone: Copilot and Cursor close a typical task in 10 minutes. Your seniors run dozens of screening calls a month and dread how much time this stage burns.

MethodCost
Leetcode screen2–4 hrs / candidate
System design interview1–2 hrs / candidate
Behavioral (STAR)1 hr / candidate
merged PR screen~2 min, automatic

* Cost estimate — screening one candidate, including engineer time

How it works

Four steps. Zero hours of engineering time.

  1. 01
    30 sec
    to set up

    Recruiter picks a task

    From the bank — matched to the candidate's level (Junior / Middle / Senior) and stack. No call, no whiteboard. 30 seconds in the panel.

  2. 02
    45–120 min
    candidate time

    Candidate opens a pull request

    Gets a private repo with real context. AI is allowed — tasks are designed so it's necessary but not sufficient.

  3. 03
    ~2 min
    after submit

    System scores it automatically

    CI tests, diff focus, commit quality, responses to auto-review. An LLM judge reads the entire PR against a structured rubric.

  4. 04
    instantly
    report ready

    Recruiter sees a ranked report

    Rubric scores, link to the PR, strengths and weaknesses. After that — just a final interview with the team for culture fit.

Levels

The task is calibrated to the level.

The product's moat is task design. We don't fight AI — we design tasks so that without understanding the system, AI is just a typewriter. Every task is hand-calibrated on live candidates.

Junior
45 min

Add a feature to a clean repo

A small project with its own conventions. You have to read the README, avoid breaking anything else, and write a test. Cursor can handle this — we're filtering out people who can't.

Key signals

  • Reads instructions30%
  • Doesn't break existing code40%
  • Writes a test30%
Expected score2.0–3.5 / 5.0
Middle
90 min

Reproduce a bug and fix it

A larger repo, vaguely phrased task: "users complain that Y behaves oddly in case Z." AI doesn't know what to fix — you have to localize the root cause.

Key signals

  • Decomposition35%
  • Choice of fix layer35%
  • Rationale in PR30%
Expected score3.0–4.5 / 5.0
Senior
120 min

Legacy with architectural debt

Task: "ship this feature so that in six months it can be extended to W without a rewrite." A design doc in the PR is required — AI will write the code, but it won't make the decisions for you.

Key signals

  • Trade-offs40%
  • Extensibility35%
  • Rationale quality25%
Expected score3.5–5.0 / 5.0
NOTE

AI is allowed and expected. A "blind" Claude solution scores 30–40/100: tests fail on edge cases, the PR description is empty, commits are one big blob. The rubric measures understanding of the system, not the fact that code got written.

What we actually measure

"Green CI" is only 20% of the signal.

The other 80% is what an LLM judge does better than a human interviewer when it has a structured rubric: reads the whole PR, the description, the commit history, and the responses to comments. No fatigue, no bias.

45%
Automated CI
55%
LLM judge
Rubric weights100% total
Automated CI
  • Tests pass
    Deterministic, no sampling error
    20%
  • Diff focus and size
    Minimal changes, no unrelated edits
    15%
  • Commit quality
    Atomic commits, Conventional Commits
    10%
LLM judge
  • Rationale in the PR description
    Whether the "why" is explained, not just the "what"
    20%
  • Task decomposition
    Train of thought, solution steps
    20%
  • Trade-offs and architecture
    Alternatives considered, choice justified
    15%

* Weights are configurable at the task template level

Who it's for

The developer is the user. The hiring manager is the buyer.

Each role gets what matters: developers get a fair async evaluation without live coding; managers get their team's time back and a sharper final round.

Developers

0
live coding on camera
~2 min
from submit to report
  • You're not coding under stress for 45 minutes — you take the task and work at your own pace.
  • AI assistants aren't banned — they're expected. Your usual Cursor / Claude / Copilot setup just works.
  • You see the rubric up front: what's being scored, which signals matter. No vibe check.

Engineering leaders

40+ hrs
returned to the team each month
1
meeting instead of 4–6 rounds
  • Your team doesn't burn 40 hours a month on screening calls.
  • The final meeting is about the person, not technical basics.
  • You see how the candidate thinks and justifies decisions — not just what they wrote.

Honest objections

Five things you're thinking right now.

01AI-resistance
What if the candidate just hands the task to Claude?
Let them. Tasks are designed so a "blind" AI solution scores 30–40 out of 100: tests fail on edge cases, the PR description is empty, commits are one big blob, responses to auto-review are generic. We don't measure "wrote code" — we measure "understood the system."
02Security
Tasks will leak onto the internet and into model training data.
Every task is parameterized: one template yields dozens of variants with different seeds, names, and emphases in the requirements. A public solution to one specific version won't pass another. Plus — an option for private tasks on your own code on the enterprise plan.
03Format
A senior won't do a three-hour take-home.
Agreed. For seniors — a 45-minute paired session: the candidate shares their screen, solves alongside AI, and the system records telemetry (how long they thought, what they searched, how much they rewrote). It fits on a calendar and measures more than a classic interview.
04Process
This doesn't replace the final interview with the team, does it?
We're not removing the interview. We're removing the stage where you burn 40 engineering hours filtering out people who simply can't code. Culture fit and "do I want to work with them for five years" stays a live conversation.
05Competitors
How is this better than HackerRank / CodeSignal / Codility?
They measure Leetcode — in 2026. We measure real work: a PR into a real repo with real context, scored by an LLM judge against a rubric. A different product category — work-sample assessment for the AI era.

Blog

Essays on hiring in the AI era.

Screening practice, LLM-judge rubrics, open reports from the closed beta, guides for recruiters and candidates — no marketing, no "book a demo."

All articles

Demo

We'll show it on your stack.

15 minutes. You tell us who you're hiring. We'll show what screening in merged would look like instead of calls. Sample tasks for your stack — in your inbox the same day.

  • Closed beta, Ukraine market
  • No prepayment, no contract
  • Response within 24 hours

Alternative

Don't want to fill out a form? Email us directly: [email protected]

What to call you
Organization name
We'll reply to this address
Optional
Helps us prepare a relevant demo

We don't share data with third parties and we don't send spam. Unsubscribe in one click, any time.