David Koleczek

I am a Senior Applied Scientist in Microsoft’s Office of the CTO, advising executives on the future of AI, its progress, and its implications, grounded in science. I build demos, write technical content, and contribute to open source. Previously I worked on M365 Copilot, Microsoft Loop, and was a member of MAIDAP.

Prior to Microsoft, I was a data scientist at MassMutual and at ISO New England where at the latter I created forecasts to reliably predict energy demand for New England, which to my knowledge, is still used daily (yes, I’m proud of that).


2022 -
Microsoft

Senior Applied Scientist

I started at Microsoft as an Applied Scientist in the MAIDAP program where I helped lead efforts to open-source a "Guided Conversations" agent framework as a demo in Semantic Kernel, create Loop Copilot, advance Copilot technology for Microsoft Federal, and automated cloud incident root cause analysis for Azure. Then as a member of the AI team for Microsoft Loop, I helped develop features for Copilot Pages such as being able to edit pages from Copilot Chat. And finally I am in my current role in the Office of the CTO.

2020 - 2022
MassMutual

Data Scientist

I was a Data Scientist at MassMutual as a member of the Data Science Development Program, working on various projects in Investment & Finance and Cybersecurity & Fraud. I completed my Master's in CS at this time as well.

2020 - 2022
UMass Amherst Manning College of Information and Computer Sciences

M.S. Computer Science

2017 - 2020
ISO New England

Data Science Intern

I was an intern in the Day-Ahead Forecasting and Related Markets team at ISO New England. I worked on a variety of projects, including a machine learning system to forecast day ahead energy demand. As of August 2021, it is being used as part of daily electric grid operations for the entire New England electric grid!

2016 - 2020
UMass Amherst Manning College of Information and Computer Sciences

B.S. Computer Science


Portfolio

Open Source

InteropRouter(opens in new tab)

InteropRouter is designed to seamlessly interoperate between the most common AI providers at a high level of quality. It uses the OpenAI Responses API types as a common denominator for inputs and outputs, allowing you to switch between providers with minimal code changes.

Eval Recipes Online Evaluation Diagram
Open Source

Microsoft Eval Recipes(opens in new tab)

Eval Recipes is a library dedicated to making it easier to keep up with the state-of-the-art in evaluating AI agents. It currently has two main components: a benchmarking harness for evaluating CLI agents (GitHub Copilot CLI, Claude Code, etc) on real-world tasks via containers and an online evaluation framework for LLM chat assistants. The common thread between these components is the concept of recipes which are a mix of code and LLM calls to achieve a desired tradeoff between flexibility and quality.

Open Source

Microsoft Amplifier(opens in new tab)

Amplifier brings AI assistance to your command line with a modular, extensible architecture. My contributions include evaluation and building out core provider modules.

Trash Dash Cover Image
Game Dev

Trash Dash(opens in new tab)

A game where you knock pieces of trash into the air and collect them in your dump truck. Avoid the obstacles and get a high score! Built for Ludum Dare 58.

Open Source

Semantic Workbench Document Assistant(opens in new tab)

The Document Assistant is an AI assistant in Microsoft's Semantic Workbench focused on being easy to use for everyone with a core feature being reliable document creation and editing, grounded in all of your context across files and the conversation.

Chat Context Toolkit Message History Management Diagram
Open Source

Chat Context Toolkit(opens in new tab)

The Chat Context Toolkit is a Python library, currently a part of Microsoft's Semantic Workbench designed to efficiently manage context for most AI agents. Read more on LinkedIn.

The chat context toolkit provides these three core, modular components:

Message History Management: Applies context engineering techniques to ensure that messages fit within a token budget.

Archive: A task for archiving and processing chunks of the message history that may no longer fit within a token budget to ensure older data can still be considered.

Virtual Filesystem: Creates a common abstraction for LLMs to read, edit, and explore files coming from a variety of disparate sources.

TinkerTasker Demo GIF
Project

TinkerTasker(opens in new tab)

TinkerTasker is an open-source and local first CLI agent similar to the likes of Claude Code and Codex. It's a project that allowed me to focus on teaching about important AI tech like the Model Context Protocol (MCP), while also still having unique aspects: namely it is fully hackable by being simple and modular and I developed it to run completely locally without any APIs at all.

Open Source

not-again-ai(opens in new tab)

not-again-ai is a collection of various building blocks that come up over and over again when developing AI products. The key goals of this package are to have simple, yet flexible interfaces and to minimize dependencies.

Diggity Diggity Diggity Dash Cover Image
Game Dev

Diggity Diggity Diggity (Dash)(opens in new tab)

DIGGITY DIGGITY DIGGITY ITS TIME TO GO RACING IN THE DEPTHS OF MOLE HILLS!!!

You are a mole who is in D.O.W.N.S.P.E.E.D. (Digging Operators with Notable Speed Pioneering Earth Excavation & Depth). Your goal is to win the championship and take home the cup! Built for Ludum Dare 57.

ReDoodle Game Screenshot
Project

ReDoodle(opens in new tab)

ReDoodle is a "daily" web puzzle game where you are given a starting image, and your goal is to transform it into a goal image through a series of prompts.

An example to demonstrate how prompt formatting impacts GPT-3.5-turbo-16k-0613 model's performance based on our experiments on multiple choice questions related to international law from the MMLU benchmark
Publication

Does Prompt Formatting Have Any Impact on LLM Performance?(opens in new tab)

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, Sadid Hasan

In the realm of Large Language Models (LLMs), prompt optimization is crucial for model performance. Although previous research has explored aspects like rephrasing prompt contexts, using various prompting techniques (like in-context learning and chain-of-thought), and ordering few-shot examples, our understanding of LLM sensitivity to prompt templates remains limited. Therefore, this paper examines the impact of different prompt templates on LLM performance. We formatted the same contexts into various human-readable templates, including plain text, Markdown, JSON, and YAML, and evaluated their impact across tasks like natural language reasoning, code generation, and translation using OpenAI's GPT models. Experiments show that GPT-3.5-turbo's performance varies by up to 40% in a code translation task depending on the prompt template, while larger models like GPT-4 are more robust to these variations. Our analysis highlights the need to reconsider the use of fixed prompt templates, as different formats can significantly affect model performance.

Open Source

Guided Conversations(opens in new tab)

Guided Conversations is a framework in Semantic Kernel for building AI agents that lead goal-driven conversations with defined constraints, where the agent initiates dialogue, follows a structured conversation flow, exercises judgment to stay on track, and generates artifacts like notes and forms throughout the interaction. Common use cases include teaching scenarios, customer service interactions, and any situation where a "creator" defines conversation goals and collects information semi-autonomously through an AI assistant.

Societies Stranding Cover Image
Game Dev

Societies Stranding(opens in new tab)

Societies around the galaxy are running out of space. You are the last hope to delay the stranding. Deliver pods from overpopulated planets (denoted with a red icon) to growing planets (denoted with blue icons). Get a high score before all the planets run out of space! Built for Ludum Dare 54.

Courier Crusaders Cover Image
Game Dev

Courier Crusaders(opens in new tab)

Enter a realm of magical deliveries in this fantasy RPG management game. Assemble your elite team of couriers and strive to thrive in the cutthroat world of delivery services. Built for Ludum Dare 53.

Barn Busters Cover Image
Game Dev

Barn Busters(opens in new tab)

Barn Busters is a physics based tower defense game inspired by Fall Guys, built for Ludum Dare 52. Placed in the top 20 for both innovation and fun and in the top 10% overall out of over 1,000 submissions.

Publication

UMass PCL at SemEval-2022 Task 4: Pre-trained Language Model Ensembles for Detecting Patronizing and Condescending Language(opens in new tab)

David Koleczek, Alex Scarlatos, Siddha Karakare, Preshma Linet Pereira

The 16th International Workshop on Semantic Evaluation (SemEval-2022)

Patronizing and condescending language (PCL) is everywhere, but rarely is the focus on its use by media towards vulnerable communities. Accurately detecting PCL of this form is a difficult task due to limited labeled data and how subtle it can be. In this paper, we describe our system for detecting such language which was submitted to SemEval 2022 Task 4: Patronizing and Condescending Language Detection. Our approach uses an ensemble of pre-trained language models, data augmentation, and optimizing the threshold for detection. Experimental results on the evaluation dataset released by the competition hosts show that our work is reliably able to detect PCL, achieving an F1 score of 55.47% on the binary classification task and a macro F1 score of 36.25% on the fine-grained, multi-label detection task.

Publication

On Optimizing Interventions in Shared Autonomy(opens in new tab)

Weihao Tan, David Koleczek, Siddhant Pradhan, Nicholas Perello, Vivek Chettiar, Nan Ma, Aaslesha Rajaram, Vishal Rohra, Soundar Srinivasan, H M Sajjad Hossain, Yash Chandak

Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022)

Shared autonomy refers to approaches for enabling an autonomous agent to collaborate with a human with the aim of improving human performance. However, besides improving performance, it may often also be beneficial that the agent concurrently accounts for preserving the user's experience or satisfaction of collaboration. We propose two model-free reinforcement learning methods that can account for both hard and soft constraints on the number of interventions. We show that not only does our method outperform the existing baseline, but also eliminates the need to manually tune a black-box hyperparameter for controlling the level of assistance. Code available at: https://github.com/DavidKoleczek/human_marl

Publication

Intervention Aware Shared Autonomy(opens in new tab)

Weihao Tan, David Koleczek, Siddhant Pradhan, Nicholas Perello, Vivek Chettiar, Nan Ma, Aaslesha Rajaram, Vishal Rohra, Soundar Srinivasan, H M Sajjad Hossain, Yash Chandak

HumanAI workshop @ Thirty-eighth International Conference on Machine Learning (ICML 2021)

Shared autonomy refers to approaches for enabling an autonomous agent to collaborate with a human with the aim of improving human performance. However, besides improving performance, it may often be beneficial that the agent concurrently accounts for preserving the user's experience or satisfaction of collaboration. We propose two model-free reinforcement learning methods that can account for both hard and soft constraints on the number of interventions.

mlfeed.tech Feed Screenshot
Project

mlfeed.tech(opens in new tab)

An NLP-powered web application to automatically curate tweets from the machine learning community on Twitter. The content was also reposted on Twitter.