From Pair Programming to Scientific Discovery: AI as Research Partner

Talk Overview

Part 1: AI for Biodiversity Research

Our team’s work
Connection to bioinformatics and biotechnology

Part 2: The Changing Landscape

AI agents in software development
Preparing for the workforce
Practical integration strategies

AI for Biodiversity ↔︎ Biodiversity for AI

AI → Biodiversity

Earth faces unprecedented biodiversity crisis
<10% of ~20M species scientifically named
Explosion of multimodal data:
- 🧬 DNA barcoding
- 📸 Imaging
- 🎤 Bioacoustics
Shift from lab to “in-the-wild”
Foundation models across modalities

Biodiversity → AI

Complex, hierarchical data structures
Long-tailed species distribution
Fine-grained recognition challenges
Out-of-distribution performance
Continual learning requirements

Let me start with why I think AI and biodiversity is such an important and interesting place to study for someone with my background. I come from deep learning and neural networks - I’ve done extensive work on vision and time series models, and more recently I’ve been getting into language models, particularly DNA sequence models.

Earth’s biodiversity faces an unprecedented crisis, with scientists warning of a sixth mass extinction. This decline threatens ecosystem health and human well-being, yet our ability to intervene is limited by vast knowledge gaps. Less than 10% of an estimated 20 million multicellular species have been scientifically named. What’s particularly exciting is that there’s an explosion of multimodal data being generated - DNA barcoding, imaging, and bioacoustics - shifting from lab-based to “in-the-wild” observations. We’re seeing the development of foundation models across the vision modality, the DNA modality, and the language modality, and it’s at this intersection where my team has been working.

But here’s what makes this research area especially compelling: while AI promises to mitigate the biodiversity crisis, biodiversity science reciprocally offers a sandbox for tackling fundamental AI and machine learning problems at scale. The complex, hierarchical nature of biodiversity data presents challenges for core ML research. The long-tailed distribution of species abundance provides a real-world context for learning from imbalanced datasets. Taxonomic classification challenges our ability to distinguish subtle differences in images or acoustic data. Ecological systems’ dynamic nature demands innovations in out-of-distribution performance and continual learning. This bidirectional relationship means we’re not just applying AI to solve biodiversity problems - we’re advancing the fundamental science of AI itself.

Key Statistics

5 million specimens
Images + DNA barcodes + taxonomic labels
Challenge: Incomplete taxonomic annotations
- Many labels only to family/genus level
- Expert annotation is difficult

Three-Way Contrastive Learning: CLIBD

Key Innovation

Three-way contrastive learning
Vision ↔︎ Text encodings of taxonomic labels ↔︎ DNA
Cross-modal retrieval

Advantage

Train with DNA, deploy with images only

Image courtesy of Gong et al. 2025

DNA-Supervised Vision: Cost-Effective Deployment

Retrieval examples showing cost-effective deployment

Cost Comparison

DNA sequencing: $$$
Image capture: $

Key Insights

DNA gives fine-grained supervision to vision
Train expensive, deploy cheap

Image courtesy of Gong et al. 2025

Taxonomic RAG: Beyond Direct Classification

Example output from Module I (light blue) and Module II (light green) for an image of a yellow garden spider (Photo attribution: David Illig).

Highlight

With Nate Lesperance (MBINF alumnus)

Image courtesy of Lesperance et al. 2025

The next project I’ll highlight is our work on taxonomic retrieval and caption generation with Nate Lesperance, who is a Master’s of Bioinformatics graduate who presented at this symposium last year. Rather than going directly from image to taxonomic label, we take a more sophisticated approach that represents a radically different way of doing taxonomic classification.

The key insight is that there’s a vast amount of textual knowledge locked up in databases and historical records. If we can get the “pixels” into the language space, we can unlock that knowledge. We generate a descriptive caption from the image, then take that text description and put it into the context of a large language model along with retrieved biodiversity text data from external sources. The language model then reasons over this evidence to provide a taxonomic classification to a level of specificity that it decides.

Handling Rare Species Through Retrieval

Taxonomic RAG Architecture: RAG-based approach for handling rare species

Note

Traditional (Vision-Only) Approach

✓ Great for common species
✗ Fails on rare taxa
✗ No uncertainty awareness

RAG Approach

✓ Handles rare species via text matching
✓ Leverages historical text-based records
✓ Knows its confidence level

Image courtesy of Lesperance et al. 2025

From Research to Career Questions

“How should I be safely integrating AI into my workflows?”

Why You Shouldn’t Dismiss AI

The pace of change is unprecedented

And it’s accelerating

📈

The METR Study: A New Moore’s Law for AI

METR Study showing doubling time of AI task completion capabilities

Key Finding

“The length of tasks AI can do is doubling every 7 months” - METR Study

Timeline Evolution

GPT-4 era: Seconds to minutes of autonomy
Current models: Minutes to hours of autonomy
Trend: Exponential capability growth

Image courtesy of METR 2025

I want to show you a report by an organization called METR. I think this is one of the most sobering studies we’ve seen recently. While many studies describe the evolution of AI systems in terms of performance benchmarks, they often don’t translate to improvements in real-world work settings.

The METR study takes a different approach - it looks at the rate of doubling of the time it takes to complete industry-relevant tasks. Here’s what’s concerning: while models around the GPT-4 era could complete tasks that take humans anywhere from three seconds to a minute, the kinds of autonomy that AI agents have achieved in recent months is on the order of several minutes to even hours.

[TRIGGER FRAGMENT] The rate at which these capabilities are doubling is around seven months. For those of you familiar with Moore’s Law, this is like a new Moore’s Law for AI capabilities.

Gradual Disempowerment: A Different Kind of Risk

Three areas where humans may be displaced by AI

“We might all find ourselves struggling to hold on to money, influence, even relevance. This new world could be more friendly and humane in many ways, while it lasts… But humans would be a drag on growth.”

— David Duvenaud, The Guardian, “Better at everything: how AI could make human beings irrelevant”

The Competitive Reality

Organizations with Humans

Speed: Slower
Cost: 2x more expensive
Reliability: Variable
Outcome: Outcompeted

Organizations with AI

Speed: Faster
Cost: 50% cheaper
Reliability: Consistent
Outcome: Market dominance

Organizations that don’t adapt to AI will be displaced by those that do

The Solution: Strategic Integration

Avoiding AI accelerates displacement

Three pillars for thriving in the AI era:

🤝 Learn to work WITH AI
🧠 Maintain human capabilities
🚀 Stay competitive & relevant

Skills Being Automated

🔍 Web Search & Literature Review

LLM-powered search replacing Google
Synthesis in seconds vs hours
Multiple sources browsed automatically

💻 Software Development

GitHub Copilot, Cursor, Claude Code
AI agent: 660 commits, top contributor
From autocomplete to autonomous development

💡 Data Analysis & Ideation

Creative tasks being automated
Research ideation affected
No longer “human-only” territory

Web Search Revolution

“I can feel my usage of Google search taking a nosedive already. I expect a bumpy ride as a new economic model for the Web lurches into view.” - Simon Willison

Two Modes of AI-Powered Search

Quick Search: 15-30 seconds, 20-40 sources
Deep Research: 15-30 minutes, comprehensive report

Let me start with web search, since it’s probably something we all do dozens of times every day without thinking about it.

There are two recent articles that have caught my attention on this topic. Simon Willison, who’s a veteran web developer and co-creator of Django, has been writing extensively about AI’s impact on web technologies. Cal Newport, a computer science professor at Georgetown who’s written influential books like “Deep Work” and “Digital Minimalism,” has been analyzing technology’s broader effects on knowledge work. Both are arguing that web search is flying under the radar but will cause massive disruption because so many people’s jobs are based on web-based research.

I want to share what Willison discussed - specifically, the difference between deep research and standard LLM web search tools. I can tell you that I myself have completely migrated off of Google search and now use OpenAI’s o4-mini-high or o3 with web search tools when I’m looking something up.

Specialized Research Tools

Elicit - Systematic Reviews

Extract specific attributes from papers
High-quality sources only
Cost: $$$ but streamlined

Other Tools

Your Agent + Semantic Scholar API
ResearchRabbit
Connected Papers

Key Benefit

Quality UI/UX, easy export to other formats

Image courtesy of Elicit

Agentic Coding: The Paradigm Shift

Evolution of AI Coding Tools

Autocomplete (GitHub Copilot) → seconds of autonomy
Pair Programming (Cursor) → minutes of autonomy
Agentic Development (Claude Code) → hours of autonomy

And the result?

“Catnip for programmers”
- Armin Ronacher

Pairing with Zed AI agent in synchronous mode

Now let me move on to software development. As students in the bioinformatics or biotechnology programs, based on my observations at last year’s posters, I know almost all of you are writing software. Many of you are probably getting exposure to agentic software development.

What we’re seeing is a fundamental shift in how we write code. Armin Ronacher, the creator of the Python Flask framework, describes this as moving away from AI simply auto-completing your thoughts to something much more powerful - real-time collaboration between human and AI. You’re working together with an agent that can break down complex tasks, execute them step by step, and work autonomously for extended periods while you oversee and guide the process.

A friend called this “catnip for programmers,” and Ronacher says it really feels like that - incredibly energizing and addictive in a way that traditional coding tools aren’t.

Two Approaches: Synchronous vs. Asynchronous

Synchronous (Cursor)

Work in IDE
Seconds to minutes
Pair programming
Direct oversight

Asynchronous (Claude Code)

Work via issues/PRs
Minutes to hours
Project management
Multiple agents

Videos courtesy of Simon Willison and Kushagrasikka

This brings us to an important distinction in how these tools work. If you’re familiar with Cursor or GitHub Copilot, you’re experiencing one approach - pair programming where you’re working inside an IDE with code completion and chat interfaces. You’re overseeing every step, maybe letting the AI work for seconds to minutes at a time.

Developers are excited about Claude Code because it represents a different paradigm - it de-emphasizes the editor interface entirely. You can let agents work for minutes to hours on substantial tasks, potentially running multiple agents simultaneously. It’s less like pair programming and more like project management.

However, Claude Code can also be used in the synchronous/pair-programming mode. And I’ve included videos here to show the tool being used in these two very different ways.

The Future: Nerd Managers

Workflow Evolution

Open GitHub issue
Agent works autonomously
Returns with PR
Review and merge

Future Vision

“Developers will be empowered to keep work queues full in large fleets of coding agents” - Steve Yegge

Video courtesy of All Hands AI

There’s an interesting recent interview between Jack Clark from Anthropic and Tyler Cowen that I recommend, where Jack Clark describes a future of “nerd managers” - people who are effectively managers of AI systems, babysitting agents that are spinning away and responding to them when they come to you, almost like managing a team of human interns. Ethan Mollick has remarked that organizations will need to update their org-charts to reflect human and machine collaboration.

In this video from All Hands AI, Graham Neubig, who is also a professor at CMU demos his own development workflow, which is largely GitHub-based rather than in an IDE. He’s effectively opening issues, describing bugs or features he needs help with in plain language. The agent takes that over and works away while he goes and opens up another issue for another agent. He can interact with an agent if it gets stuck, but the idea is that the agent returns with a pull request in GitHub with its completed work, which he can then review and merge. He’s typically overseeing 3-5 agents at a time.

What skills are needed for agentic development?

Graham Neubig from All Hands AI suggests these key skills:

🏗️🥧 Strong architecture skills and taste
💬 Communication
📖 Code reading
🔄 Multitasking

Taste and RLHF

Distribution of Average Quality and Empathy Ratings for Chatbot and Physician Responses to Patient Questions showing two panels: Panel A displays quality ratings distributions comparing chatbot vs physician responses, with chatbots showing higher ratings; Panel B shows empathy ratings distributions where physicians and chatbots have more similar distributions — Figure 1

Imitation (SFT)	RLHF	Why the difference matters
Model copies full human output distribution. That includes occasional mistakes and mediocre phrasing. Ceiling ≈ human average.	Model samples its own answers and a human simply picks the better one. Humans don’t have to create perfection, only recognise it.	Humans are far stronger critics than creators. Preference-grading lets them steer the model away from the left tail (bad answers) and pull the whole distribution rightward.
Training signal = “produce exactly what a human would have written.”	Training signal = “move towards whichever candidate the human preferred.”	Over many iterations the reward model keeps nudging the policy towards the best-judged answers, eventually surpassing the median human.

Image courtesy of Ayers et al. 2023

And there’s a really cool connection between taste and reinforcement learning with human feedback (RLHF) training, the “secret sauce” of LLMs.

Thomas Scialom, who led Llama2 and Llama3 post-training at Meta, provides compelling insight into why this discrimination ability is so powerful.

How can it be that chatbots don’t just match the physician distribution on quality or empathy, they actually go beyond it?

This is not due to supervised instruction tuning, which trains the model to match the humans’ distribution. It’s due to RLHF.

RLHF works because people are better at saying which answer is better than at writing the perfect answer themselves. By constantly rewarding the model’s best candidate and punishing the worst, you squeeze out the bad tail and push the whole distribution past typical human performance.

Developing Taste in Graduate School

Practice Deliberate Comparison

Generate multiple AI outputs for the same task and practice choosing which is better, articulating your reasoning
Develop familiarity with different frontier models and learn what each excels at
- Claude
- ChatGPT
- Gemini
Platform selection itself is an exercise in taste

Engage in Structured Peer Collaboration

Seek out reviewing opportunities for conferences and journals to practice evaluating others’ work
Participate in code review with lab mates or open source projects to develop technical judgment
Don’t work in isolation - co-author papers with both senior and junior students, getting comfortable with giving and receiving feedback

Seek Active Mentorship

Ask experienced researchers to walk through their decision-making when they assess quality
Submit your own quality judgments to mentors for validation and refinement

No “Taste 101” course - learn by doing

Taste isn’t traditionally taught directly - it’s gained implicitly. There’s no course in developing your taste. I notice this with students coming into master’s or PhD programs. Often when they’re collecting references and producing citations in early paper drafts or doing literature reviews, they’re not really discriminating among the quality of publication venues - where those works are published, or the authors or groups behind those papers.

So how are you supposed to develop this skill? Here are some concrete approaches:

Practice deliberate comparison and platform selection. When working with AI tools, generate multiple outputs for the same task and practice choosing which is better, articulating your reasoning. Develop familiarity with different AI platforms - Claude, ChatGPT, Gemini - and learn what each excels at. This platform selection itself is an exercise in developing taste.

Engage in structured peer collaboration. Seek out reviewing opportunities for conferences and journals. If that’s not possible you can paper review sessions where everyone reads the same paper, writes individual reviews, then discusses differences in judgment. Participate in code review with lab mates or in open source projects. Don’t work in isolation - co-author papers with both senior and junior students, getting comfortable with giving and receiving feedback.

And if you have the opportunity to work with experienced researchers directly, ask them to walk through their quality assessments with you. Have them give you feedback on your own quality judgments.

The AI Execution Gap

Study details:

Prior studies found settings where LLM-generated research ideas were judged as more novel than human-expert ideas
This study: 43 researchers, 100+ hours each
Randomly assigned to execute AI-generated or human-expert ideas

Chart showing the AI execution gap - comparison of novelty, excitement, effectiveness and overall scores before and after execution for both AI-generated and human-expert research ideas, demonstrating that while AI ideas may seem more novel initially, they suffer larger drops after execution

Image courtesy of Si et al. 2025

And there’s even more good news. Recent research from Stanford reveals something called the AI execution gap, which shows where humans maintain their edge. While many studies have found that LLM-generated research ideas seem more novel than human ideas at first glance, this study went further - they actually had experts implement the ideas.

Here’s what they did: 43 NLP researchers each spent over 100 hours implementing either AI-generated or human-expert ideas, then wrote papers documenting their results. When these papers were blindly reviewed, something interesting happened.

Although LLM-generated ideas may seem more novel at the ideation stage, they suffer a larger drop in novelty, excitement, effectiveness, and overall score after execution, effectively closing—and in some metrics reversing—the gap between AI and human ideas.

Are Junior Developers in Trouble?

~~MYTH: Junior developers are doomed~~

REALITY: You’re best positioned to succeed

✓ Quick to adopt AI-driven workflows
✓ Treat AI tools as on-the-job training
✓ Unburdened by legacy tooling

“Junior devs are vibing. They get it. The world is changing, and you have to adapt. So they adapt!”

“It’s not AI’s job to prove it’s better than you. It’s your job to get better using AI.” - Steve Yegge, “Revenge of the Junior Developer”

There’s been a lot of negative talk about career prospects for young professionals entering fields adjacent to bioinformatics and biotechnology. Are you all doomed? Is AI going to eliminate entry-level positions before you even get started?

I think this narrative is not only wrong, but backwards. Let me explain why I believe you’re actually in the best position to thrive in this new landscape.

Here I lean on an essay by Steve Yegge from Sourcegraph called “revenge of the junior developer”. His key argument is that it’s actually the junior developers who are positioned well for success in the AI era because they’re adaptable, dynamic, and embracing these new agentic platforms.

It’s actually the legacy developers who tend to be comfortable in their careers and tooling that face the most disruption by AI systems.

Two Simple Rules

Rule 1

“AI can be used to avoid learning, and AI can be used to assist learning”

Rule 2

“It’s ok to ask AI to do things you already know how to do,
but don’t ask AI to do things that you don’t know how to do”

Summing this all up, my colleague Andrew Hamilton-Wright, who many of you know as the co-coordinator of the MBINF research project course, said to me recently, very nicely: “AI can be used to avoid learning, and AI can be used to assist learning.”

This distinction is so important that OpenAI just released “study mode” in ChatGPT, specifically designed to help students work through problems step by step instead of just getting answers. Rather than providing solutions outright, it uses Socratic questioning and guided prompts to help you build understanding. As one student put it during testing, it’s like having “a live, 24/7, all-knowing office hours.”

How do you use AI to assist with learning rather than avoid it? Another eloquent way of capturing this was articulated by one of my postdocs, Scott Lowe: “Just don’t ask AI to do things that you don’t understand.”

Because if you get AI to do something and you don’t have any clue how it was done, you’ve completely avoided a learning opportunity.

Threading the Needle

Finding the right balance between learning with AI and automating with AI

Balance scale showing the sweet spot between learning and automation

Key Takeaways

Unprecedented change - AI capabilities doubling every 7 months
Avoiding AI accelerates displacement - Engage strategically instead
Develop your taste - The ability to discriminate quality remains crucial
Junior professionals have advantages - Adaptability + experience
Use AI to assist learning, not avoid it - Build skills while automating

Thank You & Resources

Key Papers & Reports

METR Report (Kwa et al.) on AI Task Completion 📄
Kulveit and Douglas et al. on “Gradual Disempowerment” 📄
Si et al. on “The Ideation-Execution Gap” 📄

Practical Resources

Willison: “AI assisted search-based research actually works now” 📝
Ronacher: “Agentic Coding” 🎥
All Hands AI: “What will software dev look like in 2026?” 🎥
Yegge: “Revenge of the junior developer” 📝

Contact

www.gwtaylor.ca Find me at posters!

From Pair Programming to Scientific Discovery:AI as Research Partner

Talk Overview

Part 1: AI for Biodiversity Research

Part 2: The Changing Landscape

AI for Biodiversity ↔︎ Biodiversity for AI

AI → Biodiversity

Biodiversity → AI

Three-Way Contrastive Learning: CLIBD

Key Innovation

Advantage

DNA-Supervised Vision: Cost-Effective Deployment

Key Insights

Taxonomic RAG: Beyond Direct Classification

Handling Rare Species Through Retrieval

Traditional (Vision-Only) Approach

RAG Approach

From Research to Career Questions

Why You Shouldn’t Dismiss AI

The METR Study: A New Moore’s Law for AI

Timeline Evolution

Gradual Disempowerment: A Different Kind of Risk

The Competitive Reality

Organizations with Humans

Organizations with AI

The Solution: Strategic Integration

Skills Being Automated

🔍 Web Search & Literature Review

💻 Software Development

💡 Data Analysis & Ideation

Web Search Revolution

Two Modes of AI-Powered Search

Specialized Research Tools

Elicit - Systematic Reviews

Other Tools

Key Benefit

Agentic Coding: The Paradigm Shift

Evolution of AI Coding Tools

And the result?

Two Approaches: Synchronous vs. Asynchronous

Synchronous (Cursor)

Asynchronous (Claude Code)

The Future: Nerd Managers

Workflow Evolution

Future Vision

What skills are needed for agentic development?

Taste and RLHF

Developing Taste in Graduate School

Practice Deliberate Comparison

Engage in Structured Peer Collaboration

Seek Active Mentorship

The AI Execution Gap

Are Junior Developers in Trouble?

Two Simple Rules

Threading the Needle

Key Takeaways

Thank You & Resources

Key Papers & Reports

Practical Resources

Contact

From Pair Programming to Scientific Discovery:
AI as Research Partner