Research Scientist Interview Guide

Data science

Interview Guide

Author

Kundan Kumar

Published

July 3, 2025

Overview

This guide provides a practical framework to prepare for Research Scientist roles in academia, industry research labs (FAANG, OpenAI, DeepMind, Anthropic, NVIDIA), and national labs. It blends personal experience with hiring-manager expectations and common evaluation rubrics.

GitHub repository: Code
Data Science Notes: Data Science Intro
Statistical Learning Notes: Statistical Analysis

What interviewers evaluate

Research impact: novelty, citations/adoption, reproducibility, clarity of problem–method–evidence chain.
Technical depth: math/ML fundamentals, experimental rigor, ablation thinking, error analysis.
Systems sense: how ideas become products—data, infra, metrics, reliability, safety & ethics.
Execution: scope → plan → iterate → deliver (papers, open-source, patents, internal wins).
Communication & collaboration: explain complex work to varied audiences; cross-discipline work.
Culture/LP fit: ownership, bias for action, frugality, customer/impact obsession.

Tip

Be explicit about your contribution. For each project: problem → gap → idea → method → evidence → limitations → next steps → impact.

Interview process at a glance

Typical stages
1) Recruiter + HM intro → 2) Tech/Research screens (coding, ML/math, paper deep dive) →
3) Onsite: research talk, systems/experimentation design, cross-functional, behavioral/bar raiser →
4) Debrief → offer.

(Use the SVG diagram you generated earlier or embed it with ![](research_scientist_interview_process.svg).)

Preparation timeline (6–8 weeks)

Weeks 1–2: Foundations & portfolio - Curate 2–3 flagship projects; write 1-page project briefs (problem, novelty, 3 results, open questions). - Refresh ML math: gradients, likelihoods, bias–variance, generalization, off-policy vs on-policy RL. - DS&A 20–30 mins/day (arrays, hash maps, trees, graphs, DP—medium level). - Draft talk outline; collect figures; start a reproducible repo.

Weeks 3–4: Research talk + deep dives - Build slides (30/45/60 min versions). Timebox: Motivation 10% → Method 35% → Evidence 40% → Limits + Roadmap 15%. - Prepare ablation stories and negative results; design a live error analysis demo if feasible. - Mock talks with labmates; iterate twice.

Weeks 5–6: Systems & coding polish - 5 case studies: online inference, data pipelines, eval at scale, safety/guardrails, monitoring. - Practice 6–8 coding problems in 60-min sessions; review idioms (two-pointers, heap, BFS/DFS, topo sort). - Draft answers for 8–10 behavioral prompts using STAR(L).

Week 7+: Company-specific tuning - Read team papers/repos; align your roadmap slide to their charter. - Prepare 8–12 questions to ask (below). - Dry run full onsite (talk + 3 interviews + behavioral) in a single sitting.

Research portfolio deep dive

For each project, be ready to answer: - Gap: What prior SOTA did not address? Why now? - Assumptions: Distributional, structural, or operational assumptions—how validated? - Method: Key design choices (loss, architecture, training regime, priors/constraints). - Evidence: Metrics that matter (with CIs); strongest ablation; hardest failure case. - Impact: External adoption, dataset/code release, internal KPI movement, patents. - Next: What would you do with 3 months & a small team?

Artifacts checklist - 10–12 figure slide deck (vector PDFs), 1-page PDF overview, GitHub README with quickstart, repro seed + script.

Technical machine learning knowledge (what to refresh)

RL: policy gradient theorem; advantage estimation; PPO/TRPO constraints; off-policy (DQN/TD3/SAC); safe RL & constraints; exploration vs exploitation; eval instability & seeding.
Deep learning: optimization (warmup, cosine decay, AdamW), regularization (dropout, mixup, label-smoothing), attention/transformers, LoRA/parameter-efficient finetuning.
Statistics & probabilistic modeling: MLE/MAP; conjugacy; posterior predictive; calibration (ECE), uncertainty (epistemic vs aleatoric); A/B testing pitfalls.
Generative models: diffusion schedule & guidance, VAEs ELBO, GAN stability.
LLMs: instruction tuning, RAG retrieval quality, eval (exact match, nDCG, win-rates), toxicity & safety filters, hallucination mitigation.
Vision/multimodal: contrastive learning, detection/segmentation metrics (mAP, IoU), data augmentations.

ML systems & experimentation design

5-step template 1. Clarify goal (user KPI ↔︎ technical metric; online vs offline).
2. Data (sources, labeling strategy, noise, sampling, privacy).
3. Model & infra (baseline → candidate → serving path; latency, cost, reliability).
4. Evaluation (offline metrics + counterfactual replays + online guardrails; slicing).
5. Risk & safety (bias, misuse, red-teaming, rollback plan, observability).

Example prompt (outline answer)
“Design a near real-time anomaly detector for a power grid substation.”
- KPI: reduce outage MTTR by 20%; constraints: <200 ms latency, 99.9% uptime.
- Data: PMU/SCADA streams; label via weak supervision + operator tags.
- Baseline: statistical thresholds; Model: streaming autoencoder + EWMA residuals.
- Eval: ROC-AUC offline, time-to-detect, false alarms/day; shadow deploy → phased rollout.
- Safety: fail-open; human-in-the-loop; drift detector; incident playbook.

Coding & algorithmic skills

Aim for clean, correct, then optimal. Speak invariants, test cases, and complexity out loud.
Patterns to practice: two-pointers, sliding window, monotonic stack, BFS/DFS, topological sort, binary search on answer, Dijkstra/Union-Find, prefix sums, DP on sequences/trees.
ML-adjacent coding: vectorized NumPy, PyTorch modules/forward pass, dataloaders, batching, mixed precision, sanity checks.

Research talk: structure & slides

Slide budget (45 min talk + Q&A) - Title & takeaway (1), Motivation (2), Problem/Gap (2), Method (5), Results (6), Ablations (3), Error analysis (2), Limits (1), Roadmap/fit (2).

Dos - One idea per slide; consistent color for your method; readable axes; include n and CI.
- Put the thesis of each slide in the title: “Physics constraints cut violations by 38% at same cost.”

Don’ts - Crowded plots, cherry-picked examples, unanchored qualitative claims, tiny captions.

Behavioral: STAR + research flavors

Use STAR: Situation, Task, Action, Result, Learning.
Prepare 6 stories: conflict, failure, leadership without authority, speed vs quality, mentoring, cross-team project.

Example prompt
“Tell me about a time your experiment invalidated a roadmap item.”
- S/T: critical Q3 milestone hinged on SOTA surpassing baseline.
- A: pre-registered analysis; ran holdout; flagged negative lift; proposed minimal viable alternative.
- R: saved ~6 wks eng time; reallocated to data quality; shipped smaller win.
- L: add “early stop” gates; improved pre-mortem checklist.

Recommended resources

Papers: recent NeurIPS/ICLR/ICML tracks relevant to the team; read 2–3 team papers.
Books: Designing Machine Learning Systems (Huyen), Deep Learning (Goodfellow), ESL (HTF), Probabilistic ML (Barber/Murphy).
Practice: LeetCode medium sets; pair-program ML design prompts; mock talks.

Example technical & research questions

Summarize the core contribution of your latest paper in two sentences.
Which ablation most strongly supports your claim? Which one failed and why?
How would you adapt your method under 10× less data? Under severe shift?
What is your evaluation blind spot today? How would you close it?
Explain epistemic vs. aleatoric uncertainty with a concrete modeling choice.
For PPO, where does instability come from and how do you diagnose it?
Design a reliable RAG system for safety-critical queries—retrieval, scoring, guardrails, and evaluation.

“Ask the interviewer”

Use the set that best fits the team. If time is short, ask the Top 3 in each block.

Hiring Manager (core, any research team)

Top 3
- What research bets matter most in the next 6–12 months, and how will you measure success?
- What makes a “hell-yes” hire here after 90 days? What work would signal that?
- How do ideas transition from a paper/prototype to production or a public result?
How is impact recognized—publications, product metrics, patents, internal adoption?
What are examples of projects that didn’t land? Why, and what changed afterward?
Where are the biggest data/infra bottlenecks that a new scientist can unlock?

Research Scientists (peer scientists)

Top 3
- Which canonical datasets/eval harnesses are used for your area? How are baselines enforced?
- What’s a recent ablation or negative result that changed your roadmap?
- How do you share/replicate experiments (internal tooling, seeds, result store)?
How are collaborations formed across teams? Any “platform” teams I should align with?
What’s the cadence for paper reviews, reading groups, and internal talks?

Engineers / MLEs (production & infra)

Top 3
- What are the serving constraints (latency, throughput, cost) and reliability SLOs?
- What’s the path from notebook → feature store → online eval → rollout/rollback?
- How do you monitor drift and failures in the wild? What’s the on-call/ownership model?
What’s the CI/CD story for models (gating tests, shadow, canary, A/B infra)?
What would you change in our current stack if you could?

PM / Cross-functional (applied impact)

Top 3
- Which decision or workflow actually changes if this model improves by X%?
- What is the single metric you’d show leadership to justify continued investment?
- What risks (safety, bias, misuse) keep you up at night for this application?
Where does data come from and how does quality/coverage limit the roadmap?

Recruiter / Compensation

Level targeting and calibration—what evidence best demonstrates readiness for this level?
Publication & open-source policy (authors, timing, preprints), conference travel norms.
Visa/relocation timeline; expected hire start window and interview re-try policy.

Example Domain-specific

If the team is RL / Control (energy, robotics, autonomy)

What control horizon & loop latency are assumed (e.g., 200 ms, 1 s)? Any hard real-time constraints?
How are safety constraints enforced (e.g., physics-informed losses, shielded RL, reachability)?
What are the reference environments (e.g., CityLearn/PowerGym, IEEE 13/34/123 bus, HIL)?
How do sim-to-real gaps show up, and how do you mitigate them (domain randomization, calibration)?
Which offline evaluation and counterfactual replay methods are trusted before online trials?
What’s the bar for replacing a heuristic/OPF with an RL policy (guardrails, rollback, audits)?

If the team is LLM / RAG / GenAI

Retrieval stack: index type, chunking, rerankers, eval (nDCG, recall@k, answer faithfulness).
Hallucination budget & safety: filters/guardrails, red-teaming, and incident handling.
Finetuning strategy: SFT/LoRA vs. prompt-only; distillation plans; model/versioning policy.
What constitutes “win” in offline eval vs. human eval? How do you resolve conflicts?
Data governance: PII/PHI handling, dedup, license compliance, auto-eval for drift.

If the team is Vision / Multimodal

Canonical datasets & metrics (COCO mAP, IoU, retrieval R@k); how are domain shifts handled?
Labeling strategy & quality control; synthetic data or augmentation policies.
Deployment constraints: throughput on edge vs. server; quantization/compilation toolchain.
Failure modes that matter most (false positives/negatives, OOD, adversarial artifacts).

If you only have time for 3 questions (universal)

Strategy: What is the one result you’d want me to deliver in 6 months that proves this hire was a “yes”?
Execution: What is the critical bottleneck (data, infra, evaluation) preventing faster progress today?
Fit: Which strengths would make me the complement to the current team’s skills?

Checklists

Day-before technical Interview

Talk readiness: 30/45/60-min versions; 1-sentence thesis; clearly mark your contributions.
Ablations & limits: 1 strongest ablation you can defend; 1 negative result and what it taught you.
Paper deep dives: be ready to derive the key equation/algorithm; compare to 2 baselines with numbers.
Math/ML refresh: RL (policy gradient, GAE, PPO/TRPO intuition), DL (optimizers, regularization), stats (MLE vs MAP, bias–variance, eval metrics).
Coding drills: 2 timed mediums using core patterns (two-pointers, BFS/DFS, heap, topo sort, DP); practice test-first pseudocode.
Systems prompts: outline 3 cases (data → model → eval → guardrails → rollout) you can walk through crisply.
Q&A bank: 10 answers you can deliver fast (assumptions, failure modes, robustness, scalability, safety).

Day-of Interview

Open strong: restate problem + constraints; define the success metric before proposing solutions.
Reasoning first: outline 2–3 approaches; justify trade-offs; state target time/space complexity.
Coding round: implement incrementally; speak invariants and edge cases; analyze complexity after passing tests.
Research talk: show the central figure; defend an ablation; admit a limitation and the next experiment.
Design/experimentation: baseline → candidate → eval plan; define slices; propose risks & rollback.
Close each round: 30–60s recap with decision, trade-offs, and “next step” you’d run.

Offer, Debrief & Negotiation

1) Right after the loop

Keep a short brag doc: 5–8 bullets tying your work to measurable impact (citations, benchmarks, improvements).
Send thank-you notes; ask the recruiter for decision timeline and whether the team needs any follow-ups (extra slides, code pointers).

2) Debrief (if you don’t get detailed feedback)

Ask for 3 bullets: (i) strengths that stood out, (ii) top concern, (iii) what would change the decision next time.
If there’s a miscalibration (e.g., level/scope), propose a targeted follow-up (short tech screen or focused deep dive).

3) Offer review (when it comes)

Break it down: base + bonus + equity/RSUs (grant, vesting, refreshers) + sign-on + extras (compute budget, conference travel, publication policy, patent bonus).
Clarify: level, title, team mandate, location policy, start date, performance review cycles, and conference/OSS policies.

4) Negotiation (impact-first)

Anchor with scope & impact (what you can deliver in 6–12 months), plus comparables (peer offers or market data for the same level/geo).
Prioritize asks (pick 2–3): sign-on / equity / level / research budget / conf travel.
Sample script:
“Given the scope (X) and the impact I’m positioned to deliver (Y), I’m targeting total comp of Z at level L. If level is fixed, increasing equity by A or adding a B sign-on would bridge the gap. I’d also value C (e.g., conference travel commitment).”

5) Timelines

If you need time: “I’m very excited. To make a well-considered decision, could we set a reply date of ? I want to complete one pending loop and compare scopes fairly.”
If there’s an exploding deadline, ask for a short extension in exchange for a firm decision date and clear intent.

6) If rejected

Request specific growth areas and ask whether a re-interview window (e.g., 6 months) with a targeted bar (e.g., systems/experiments) is possible.

Mentorship

If you’d like feedback on your talk, paper deep dive, or a full mock onsite, reach me at cs.kundann@gmail.com.