← Back to Home
SETlib hero

SETlib

Reducing worksheet preparation time by 81% by centralizing 17 years of course content

Role
Product Design Intern
Team
3 Designers5 EngineersProject Manager
Timeline
June 2025 – January 2026
Tools
FigmaReact

100,000+ files organized by date— but faculty think in topics. I redesigned how CS facilitators find and assemble course materials across two major releases.

Facilitators at UW Tacoma build weekly CS worksheets from 17 years of archived problems — but everything was filed by quarter and week, not by concept. This mismatch cost 3–5 hours per week, and every facilitator I interviewed admitted they'd given up on searches entirely.

I owned the facilitator experience end-to-end. V1 was a defensive MVP built around an unreliable AI parser. When production revealed critical limitations, I used those findings to shape engineering requirements for a new parser, then redesigned the key workflow in V2.

Facilitators spent 4+ hours per week hunting for problems because 17 years of content was organized by date, not by how anyone actually teaches.

Facilitators are undergraduate TAs paid 10 hours per week, with six hours outside the classroom split between meetings and creating unique weekly worksheets. To build these, they pull from 17 years of archived problems stored across scattered Google Drives, organized chronologically by quarter and week.

But facilitators think in concepts: "I need a binary tree problem" or "I need something on recursion." The system's organization didn't match anyone's mental model, so finding the right problems meant manually browsing folders until something looked right, or giving up and rewriting from scratch.

Current Facilitator Workflow3–5 Hours Weekly
1
Open Google Drive
Navigate to course archive
2
Browse by quarter
Fall 2023 → Winter 2022 → Spring 2021...
3
Open folder, scan files
Read filenames, guess at content
4
Wrong topic
File doesn't match what they need
FRICTION
5
Go back, try another folder
Repeat across quarters and weeks
FRICTION
REPEAT
!
Give up searching
Cost of searching exceeds cost of rewriting
FAILURE
!
Rewrite from scratch
Duplicate work that already exists somewhere in the archive
FAILURE

The AI parser meant to automate content migration was unreliable, and that shaped every design decision.

The project had an AI-powered LaTeX parser responsible for extracting and structuring archived content. But it could only handle LaTeX files, struggled with complex math notation and embedded images, and failed unpredictably. This was the system's defining technical constraint.

The two primary user groups also had no communication loop. Facilitators created worksheets while professors oversaw curriculum quality, but there was no approval process, feedback mechanism, or shared visibility into what was being created.

Every facilitator I interviewed had given up searching for content. The system wasn't slow, it was failing them entirely.

I interviewed 6 facilitators and 2 professors, focusing on where time was lost and what prevented them from finding the right problems.

Users weren't just inefficient. They were abandoning searches because the time spent finding a problem exceeded the cost of rewriting one from scratch. This reframed the problem from "make search faster" to "make search trustworthy enough that people use it at all."

I also found that facilitators and professors had fundamentally different needs. Facilitators needed speed. Professors needed oversight and the ability to approve content. They needed the same system with different priorities.

I built a React prototype and integrated it with the parser API to understand exactly where the system would break.

Engineering had shared that the parser struggled with complex formatting, but I needed concrete data to design effective failure states. Rather than wait for the full implementation, I built a functional frontend in React and connected it to the backend API to stress-test the parser with real content before committing to design decisions.

Basic LaTeX parsed reliably. Complex notation, math diagrams, and anything with embedded structure broke consistently. Most importantly, failures were unpredictable: two nearly identical problems could produce completely different outputs.

React PrototypeV1
React prototype integrated with parser API

The parser failed unpredictably, so V1 was built to expose problems rather than hide them.

Every decision in V1 traces back to one question: when the system gets something wrong, how does the user know, and what can they do about it?

The facilitator-facing experience (search, assembly, review, dashboards) shipped as the MVP, while the parsing and content insertion tools ran as a separate internal interface. This let us deliver value to facilitators immediately while engineering continued developing the parser in parallel.

Side-by-side validation gave users full transparency into what the parser got right and wrong.

Every uploaded problem required manual review before it could enter the system. I designed a split-view that placed the original source file next to the parsed output so users could instantly spot where content was misinterpreted. Errors were flagged inline rather than hidden behind generic warnings, and manual correction tools let users fix issues directly.

This was a deliberate trade-off: mandatory review slowed down every upload, even successful ones. But with the parser failing this frequently, trust mattered more than speed.

side-by-side validationV1
side-by-side validation

Search was reorganized around concepts using patterns facilitators already understood.

The taxonomy shifted from chronological filing to topic-based browsing. Instead of navigating folders labeled "Fall 2019, Week 7," facilitators could filter by concepts like Data Structures, Algorithms, or Recursion. I validated these categories against past curriculum and lecture sequences to make sure the taxonomy matched real mental models.

Difficulty filtering (Easy, Medium, Hard) was inspired by LeetCode, a framework CS students already use to judge problem difficulty. The overall interaction followed an e-commerce pattern: browse, filter, add to cart. Users don't need to learn a new system. They're shopping for problems.

problem database with filtersV1
problem database with filters

Assembly and review mirrored real facilitator behavior and closed the communication gap with professors.

Once problems were selected, facilitators moved into assembly where they could reorder and finalize their worksheet. A summary panel provided real-time feedback on difficulty balance and problem count, because facilitators told me they didn't want surprises at the review stage.

The final review step was intentionally simple: a clean preview of the worksheet as it would appear when published, with a reviewer notes field so facilitators could provide context to professors directly alongside their submission. This eliminated back-and-forth that previously happened over email or not at all.

assembly viewV1
assembly view
review and submitV1
review and submit

The dashboard surfaced what matters most for each role.

Facilitators saw pending approvals at the top as an accountability loop, with recent worksheets below for quick reference and reuse. Professors saw all pending approvals across sections with submitter names and timestamps, giving them oversight without requiring facilitators to push updates.

facilitator dashboardV1
facilitator dashboard
professor dashboardV1
professor dashboard

Prep time dropped from 4.2 hours to 0.8 hours, but production exposed a critical limitation.

We validated V1 with 8 facilitators over 4 weeks using identical tasks. The results confirmed the core concept worked: an 81% reduction in prep time, with facilitators completing worksheets in a single sitting instead of spreading work across multiple days.

Facilitators told us prep finally felt predictable. Professors said seeing all pending submissions in one place removed years of back-and-forth. But once V1 went into production for fall quarter, we saw something our controlled tests hadn't revealed.

0%
Prep Time Reduction
From 4.2 hours to 0.8 hours per worksheet
$0K
Projected Annual Savings
Reduced labor costs from streamlined preparation
0
Facilitators Validated
4-week controlled study with identical tasks

90% of parser failures traced back to embedded images, the most common content type in CS problems.

Our prototype testing used small, controlled, text-only samples. Production was different. Real CS problems are full of tree diagrams, graph visualizations, screenshots, and annotated figures. The parser couldn't interpret any of them.

V1's core promise broke down for exactly the content facilitators relied on most. Mandatory review became the default experience rather than the safety net we designed it as. Facilitators started ignoring the review step or asking to bypass parsing altogether. One faculty member put it directly: "Can I just insert problems manually?"

The defensive UX was working as designed. But the constraint it was designed around turned out to be far more severe than initial testing suggested. UX alone couldn't fix this.

My production findings directly shaped what engineering built next.

Engineering had been developing a hybrid LLM parser in parallel, and it reached readiness around the same time these issues surfaced. Our teams were in regular communication, and the failure patterns I surfaced from production helped inform the new parser's priorities.

I brought three inputs to that conversation: the 90% image failure data, the pattern of which content types broke versus succeeded, and user feedback showing mandatory review was creating friction even on successful parses. These helped engineering prioritize image interpretation, multi-format support, and granular diagnostics rather than just improving LaTeX accuracy.

V1 proved the workflow concept. Production proved the parser needed to evolve. The data I gathered from real usage became the bridge between the two.

A hybrid LLM parser changed what was possible, so I redesigned the workflow around trust instead of caution.

Engineering's new parser combined rule-based parsing with AI-driven interpretation. It handled PDFs, Word files, and mixed formats, interpreted embedded images, and surfaced specific diagnostics about what it struggled with and why.

This fundamentally changed the design problem. V1 asked "how do we help users recover from failure?" V2 asked "how do we help users know when they don't need to intervene at all?"

I designed the confidence scoring framework that engineering built the parser's output around.

The new parser could assess its own accuracy but had no way to communicate that to users. I defined the threshold framework that translated parser performance into user-facing guidance.

High-confidence parses surfaced a score and a clear success state, with "trust and move on" as the default. Medium-confidence parses flagged specific issues while showing what succeeded. Low-confidence parses triggered a full diagnostic view with guided next steps.

Engineering built the parser's API to surface scores and diagnostics against this framework. The threshold definitions, diagnostic categories, and mapping between accuracy ranges and user actions all originated from the design side.

high-confidence parse with confidence score and previewV2
high-confidence parse with confidence score and preview

When the parser struggled, V2 showed exactly why and gave users clear paths forward.

Low-confidence parses surfaced a diagnostic breakdown with plain-language explanations ("LaTeX equations not fully preserved" or "parser could not confidently determine problem difficulty") rather than generic error messages. Recommended actions gave users concrete next steps: edit directly, try a different file format, or switch to manual insertion.

low-confidence parse with diagnostics and recommended actionsV2
low-confidence parse with diagnostics and recommended actions

Manual insertion became a first-class workflow, not just a fallback.

This came directly from V1 feedback. I designed a complete manual entry flow: type problem content, attach images, set metadata, and see a live preview. Positioning this as a deliberate workflow rather than an emergency exit was important. Some content will always be easier to enter manually than to parse, and giving users that option without making it feel like a failure state respected their judgment.

manual problem entry with live previewV2
manual problem entry with live preview

V2's trade-off: the confidence scores had to be trustworthy for optional editing to work.

If a high-confidence score ever let a badly parsed problem through, users would stop trusting the system and revert to reviewing everything, recreating V1's friction without V1's safety net. I set the bar for "high confidence" high enough that users who skipped review would rarely encounter errors, even if that meant more uploads landed in the medium-confidence tier than strictly necessary.

V2 eliminated the friction that held V1 back.

V1 proved the core concept: topic-based search and structured workflows cut prep time by 81%. But mandatory review slowed adoption when the parser couldn't keep up. V2 replaced blanket caution with calibrated trust.

0%
Fewer Mandatory Reviews
While maintaining 96% content accuracy
0%
Trust-by-Default Adoption
Users completed uploads without manual edits
0%
Facilitator Adoption
Professors mandated the switch within one quarter

V1 saw inconsistent usage during fall quarter because facilitators didn't trust the parser. After V2 shipped for winter quarter, professors mandated the switch. The system that faculty once worked around became the one they required.

When your backend is unreliable, transparency becomes the product.

V1 taught me that users will tolerate imperfect systems if they can see what's happening and fix what's wrong. V2 taught me the inverse: when the system becomes reliable, the fastest thing you can do is get out of the user's way. Trust isn't a feature you add. It's a relationship between system capability and interface honesty that has to be recalibrated every time the technology changes.

Getting close to engineering made me a better designer.

Building the React prototype, stress-testing the parser, and defining the confidence scoring framework all required understanding the system at a technical level. That proximity changed which questions I asked, which constraints I pushed back on, and which ones I designed around. The most impactful decisions in this project came from understanding the backend well enough to know what was possible and what was worth advocating for.

What I'd do differently.

The React prototype was the right instinct, but starting it earlier would have caught the image failure pattern before launch rather than after. And the engineering collaboration that drove V2's success would have been even more effective as the default working mode from day one rather than something I built toward over time.