Data Mining, Cleansing, and Alignment
MSK Extract
Since the time of Florence Nightingale, scientists have grappled with unstructured data. The nature of research, often siloed due to funding constraints, ensures teams work in isolation. It’s only after publication that data sharing is considered, leaving datasets unaligned and often unusable without significant effort.
MSK Extract simplifies and elevates the process of data alignment, particularly for researchers unaware of the need to clean and normalize their data. Previously, researchers relied on terminologists and developers to process their datasets—a process that could take months. MSK Extract reduces this timeline to mere hours. Once data is ingested and aligned, ownership can be defined, contributors can be governed, and data integrity is safeguarded against loss or corruption.
I played a pivotal role in transforming MSK Extract, driving it from a 92% abandonment rate to 100% adoption. Here are some of my key contributions:
John Philip’s original wires in Balsamic
MSK Extract original application
1. Research and Discovery
The original product was designed ad-hoc for one primary user, Molly, a terminologist. While functional for her needs, word spread among other researchers, overwhelming Molly with inquiries. This prompted Extract to pivot toward a self-service model. I began by interviewing colleagues submitting data to Molly, uncovering their perceptions of Extract and its potential. These interviews provided invaluable insights, including a “top ten” list of frustrations and desired functionalities. This research shaped the product’s trajectory and guided its evolution.
2. From Pipeline to Project
During research readouts, I identified recurring but unspoken frustrations:
• Corrupted data
• Data files too large to process
• Lost or unbacked-up data
• Data localized on individual workstations
• Forked datasets caused by duplicate updates
These issues stemmed from the traditional pipeline structures common in research organizations. I proposed abandoning the pipeline model in favor of a cloud-based, project-oriented structure. By retiring the ad-hoc feature-driven approach and emphasizing componentization, we could expose core functionalities through APIs. This concept was well-received by the architecture team, initiating our migration to the cloud.
Individual pipeline apps were bundled together in Extract
MSK Extract original application
3. Inclusive Design
Early designs frustrated Molly, who struggled to locate primary functionality, despite positive feedback from other researchers. During a conference, Molly revealed she couldn’t see a link on a presenter’s slide. The realization that Molly is color-blind struck me—a “smack in the head” moment. Early designs relied heavily on blue tones, which Molly couldn’t distinguish. I adapted the interface to prioritize primary actions over color, creating a “Molly Theme” tailored for her. I even designed an Orca logo for Molly, inspired by her love for orcas. (Don’t tell FedEx—I may have borrowed inspiration from their design!)
4. Metrics and Analytics
Analytics are vital for any modern software. With 15 years of experience in data analytics, I naturally embedded “Analytics Inside” into the system. When MSK adopted HEAP as its analytics platform, I encountered a challenge: the application, built with an efficient version of ANT Design, had overly generalized markup, making data extraction difficult. Drawing on my front-end skills, I designed a migration path for the primarily back-end development team. Using a mix of data attributes and wrappers, I ensured the design system could support analytics without risk of divergence. Presenting this solution impressed the developers and reinforced the importance of designing systems that consider both form and function.
Take a prototype for a drive!
Research, Design, and Prototypes
Like most of my projects at MSKm we go through a heavy research “Design Think” process to understand not only the work to be done … but how the work should best be done.
These alignment artifacts are the resut of iterative testing from interview to high-end prototypes. The insight gleaned from taking the time effectively research is immeasurale. How else would we have known that
- different research silos use different terms for the same operation?
- Or that everyone knew that they wanted to use Extract – but know one knew exactly what it was?
- Or individuals would often abandon projects because the didn’t know how to prepare their data …
- Or nobody really understood why MSK Extract had a “Shop” – what could a data application sell?
Here is a link to a high fidelity prototype I created to do user testing. I will say most of it is probably broken now – and for the purpose of a public portfolio – that is appropriate. You can get an idea of what we might have been doing …
Fragments for a History of Extract Design
Below are somewhat random alignment artifacts from the past three years.
Figma Derivates and Thought Exercises
Designing the Project Checklist
Designing Data Mapping
Self-service research data curation: One major pain point for the researcher at MSK is
that the high value data points are in unstructured texts in the doctor’s notes. Many
researchers consequently curate the data manually for their research projects into
spreadsheets. Over the past few years, my team and I have been working on solutions to
capture this curation and make it a reuseable data asset. John interviewed many researchers
and started building prototypes that we could test easily.
With John and his design team, we were able to build a few workflows into a product
called MSK Extract available now to all researchers at MSK. MSK Extract allows
researchers to build a curation database without knowing what SQL or ER diagrams are;
load spreadsheet data into their custom database; map their data elements to MSK’s
standard terminology; and share their data back to MSK when they are ready to have it
reused. John’s designs have made it intuitive for researchers since some onboard
without using the available training videos or documentation. Currently over 300
databases have created using MSK Extract over the past three years.
Conclusion
Reflecting on three years of work with MSK Extract is both rewarding and exhausting. It has been a journey of learning, innovation, and collaboration, embodying the principles of design thinking and pushing the boundaries of what’s possible in oncology research.