PyHealth Datasets

21 Clinical Datasets

A unified API for loading healthcare data across electronic health records, physiological signals, medical imaging, genomics, and clinical text — all ready for machine learning.

Actively growing — new datasets added regularly. View planned additions →

Browse Datasets 21

Full API reference →
MIMIC3Dataset EHR

MIMIC-III Critical Care database with 40K+ ICU stays covering diagnoses, procedures, medications, lab results, and free text notes.

from pyhealth.datasets import MIMIC3Dataset
dataset = MIMIC3Dataset(
    root="data/mimic3/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD"],
)
MIMIC4Dataset EHR

MIMIC-IV with 300K+ patients spanning hospital and ICU encounters, including structured EHR tables and clinical notes.

from pyhealth.datasets import MIMIC4Dataset
dataset = MIMIC4Dataset(
    root="data/mimic4/",
    tables=["diagnoses_icd", "labevents"],
)
eICUDataset EHR

Multi-center eICU Collaborative Research Database with 200K+ ICU stays from 208 hospitals across the US.

from pyhealth.datasets import eICUDataset
dataset = eICUDataset(
    root="data/eicu/",
    tables=["diagnosis", "medication"],
)
OMOPDataset EHR

OMOP Common Data Model loader — a standardized format for observational health data used widely across academic medical centers.

from pyhealth.datasets import OMOPDataset
dataset = OMOPDataset(
    root="data/omop/",
    tables=["condition_occurrence"],
)
MIMICExtractDataset EHR

MIMIC-Extract pre-processed cohort with time-series vitals and lab values for 34K ICU patients, ready for timeseries modeling.

from pyhealth.datasets import MIMICExtractDataset
dataset = MIMICExtractDataset(
    root="data/mimic-extract/",
)
EHRShotDataset EHR

EHRShot few-shot benchmark derived from EHRSFM with 15 clinical prediction tasks for evaluating foundation models on EHR data.

from pyhealth.datasets import EHRShotDataset
dataset = EHRShotDataset(
    root="data/ehrshot/",
)
Support2Dataset EHR

SUPPORT2 study dataset for survival prediction with demographics, vitals, labs, comorbidities, and 2-month / 6-month survival labels.

from pyhealth.datasets import Support2Dataset
dataset = Support2Dataset(
    root="data/support2/",
)
CardiologyDataset Biosignal

Multi-label ECG dataset for cardiac arrhythmia detection with 12-lead signals annotated across multiple cardiac conditions.

from pyhealth.datasets import CardiologyDataset
dataset = CardiologyDataset(
    root="data/cardiology/",
)
ISRUCDataset Biosignal

ISRUC-Sleep dataset with polysomnography recordings (EEG, EOG, EMG) from healthy subjects and patients with sleep disorders.

from pyhealth.datasets import ISRUCDataset
dataset = ISRUCDataset(
    root="data/isruc/",
)
DREAMTDataset Biosignal

DREAMT study dataset with multi-night wrist-worn EEG data for sleep staging in a free-living environment.

from pyhealth.datasets import DREAMTDataset
dataset = DREAMTDataset(
    root="data/dreamt/",
)
SHHSDataset Biosignal

Sleep Heart Health Study with overnight polysomnography from 5,800+ participants for sleep staging and cardiovascular risk research.

from pyhealth.datasets import SHHSDataset
dataset = SHHSDataset(
    root="data/shhs/",
)
SleepEDFDataset Biosignal

SleepEDF-78/Cassette dataset with full-night EEG recordings from 78 healthy subjects, widely used for sleep staging benchmarks.

from pyhealth.datasets import SleepEDFDataset
dataset = SleepEDFDataset(
    root="data/sleepedf/",
)
BMDHSDataset Biosignal

BMDHS heart sound dataset with phonocardiogram recordings for cardiac valve disease classification and auscultation research.

from pyhealth.datasets import BMDHSDataset
dataset = BMDHSDataset(
    root="data/bmdhs/",
)
TUABDataset Biosignal

Temple University EEG Abnormal (TUAB) corpus with clinical EEG recordings labeled as normal or abnormal by board-certified neurologists.

from pyhealth.datasets import TUABDataset
dataset = TUABDataset(
    root="data/tuab/",
)
TUEVDataset Biosignal

Temple University EEG Events (TUEV) corpus annotated with six EEG event types including spike-wave complexes and artifacts.

from pyhealth.datasets import TUEVDataset
dataset = TUEVDataset(
    root="data/tuev/",
)
COVID19CXRDataset Image

COVID-19 chest X-ray dataset for binary classification of COVID-19 vs. normal and pneumonia images across multiple sources.

from pyhealth.datasets import COVID19CXRDataset
dataset = COVID19CXRDataset(
    root="data/covid19cxr/",
)
ChestXray14Dataset Image

NIH ChestX-ray14 with 112K frontal chest X-rays from 30K patients, annotated with 14 disease labels for multi-label classification.

from pyhealth.datasets import ChestXray14Dataset
dataset = ChestXray14Dataset(
    root="data/chestxray14/",
)
ClinVarDataset Genomics

NCBI ClinVar database of genomic variants and their clinical significance, used for variant pathogenicity classification.

from pyhealth.datasets import ClinVarDataset
dataset = ClinVarDataset(
    root="data/clinvar/",
)
COSMICDataset Genomics

Catalogue of Somatic Mutations in Cancer (COSMIC) for predicting the functional impact of somatic mutations in cancer.

from pyhealth.datasets import COSMICDataset
dataset = COSMICDataset(
    root="data/cosmic/",
)
TCGAPRADDataset Genomics

TCGA Prostate Adenocarcinoma (PRAD) multi-omics dataset for cancer survival prediction using RNA-seq, CNV, and mutation profiles.

from pyhealth.datasets import TCGAPRADDataset
dataset = TCGAPRADDataset(
    root="data/tcga/",
)
MedicalTranscriptionsDataset Text

Kaggle medical transcriptions dataset with 5K+ clinical notes across 40 specialties for specialty classification and NLP tasks.

from pyhealth.datasets import MedicalTranscriptionsDataset
dataset = MedicalTranscriptionsDataset(
    root="data/medical_transcriptions/",
)

Need a custom dataset?

PyHealth makes it easy to add your own data source by inheriting BaseDataset. Your custom dataset immediately gets access to all PyHealth tasks, models, and the trainer — no extra wiring needed.

Custom Dataset Tutorial → View Source on GitHub →