PyHealth Datasets
21 Clinical Datasets
A unified API for loading healthcare data across electronic health records, physiological signals, medical imaging, genomics, and clinical text — all ready for machine learning.
Browse Datasets 21
Full API reference →MIMIC-III Critical Care database with 40K+ ICU stays covering diagnoses, procedures, medications, lab results, and free text notes.
from pyhealth.datasets import MIMIC3Dataset dataset = MIMIC3Dataset( root="data/mimic3/", tables=["DIAGNOSES_ICD", "PROCEDURES_ICD"], )
MIMIC-IV with 300K+ patients spanning hospital and ICU encounters, including structured EHR tables and clinical notes.
from pyhealth.datasets import MIMIC4Dataset dataset = MIMIC4Dataset( root="data/mimic4/", tables=["diagnoses_icd", "labevents"], )
Multi-center eICU Collaborative Research Database with 200K+ ICU stays from 208 hospitals across the US.
from pyhealth.datasets import eICUDataset dataset = eICUDataset( root="data/eicu/", tables=["diagnosis", "medication"], )
OMOP Common Data Model loader — a standardized format for observational health data used widely across academic medical centers.
from pyhealth.datasets import OMOPDataset dataset = OMOPDataset( root="data/omop/", tables=["condition_occurrence"], )
MIMIC-Extract pre-processed cohort with time-series vitals and lab values for 34K ICU patients, ready for timeseries modeling.
from pyhealth.datasets import MIMICExtractDataset dataset = MIMICExtractDataset( root="data/mimic-extract/", )
EHRShot few-shot benchmark derived from EHRSFM with 15 clinical prediction tasks for evaluating foundation models on EHR data.
from pyhealth.datasets import EHRShotDataset dataset = EHRShotDataset( root="data/ehrshot/", )
SUPPORT2 study dataset for survival prediction with demographics, vitals, labs, comorbidities, and 2-month / 6-month survival labels.
from pyhealth.datasets import Support2Dataset dataset = Support2Dataset( root="data/support2/", )
Multi-label ECG dataset for cardiac arrhythmia detection with 12-lead signals annotated across multiple cardiac conditions.
from pyhealth.datasets import CardiologyDataset dataset = CardiologyDataset( root="data/cardiology/", )
ISRUC-Sleep dataset with polysomnography recordings (EEG, EOG, EMG) from healthy subjects and patients with sleep disorders.
from pyhealth.datasets import ISRUCDataset dataset = ISRUCDataset( root="data/isruc/", )
DREAMT study dataset with multi-night wrist-worn EEG data for sleep staging in a free-living environment.
from pyhealth.datasets import DREAMTDataset dataset = DREAMTDataset( root="data/dreamt/", )
Sleep Heart Health Study with overnight polysomnography from 5,800+ participants for sleep staging and cardiovascular risk research.
from pyhealth.datasets import SHHSDataset dataset = SHHSDataset( root="data/shhs/", )
SleepEDF-78/Cassette dataset with full-night EEG recordings from 78 healthy subjects, widely used for sleep staging benchmarks.
from pyhealth.datasets import SleepEDFDataset dataset = SleepEDFDataset( root="data/sleepedf/", )
BMDHS heart sound dataset with phonocardiogram recordings for cardiac valve disease classification and auscultation research.
from pyhealth.datasets import BMDHSDataset dataset = BMDHSDataset( root="data/bmdhs/", )
Temple University EEG Abnormal (TUAB) corpus with clinical EEG recordings labeled as normal or abnormal by board-certified neurologists.
from pyhealth.datasets import TUABDataset dataset = TUABDataset( root="data/tuab/", )
Temple University EEG Events (TUEV) corpus annotated with six EEG event types including spike-wave complexes and artifacts.
from pyhealth.datasets import TUEVDataset dataset = TUEVDataset( root="data/tuev/", )
COVID-19 chest X-ray dataset for binary classification of COVID-19 vs. normal and pneumonia images across multiple sources.
from pyhealth.datasets import COVID19CXRDataset dataset = COVID19CXRDataset( root="data/covid19cxr/", )
NIH ChestX-ray14 with 112K frontal chest X-rays from 30K patients, annotated with 14 disease labels for multi-label classification.
from pyhealth.datasets import ChestXray14Dataset dataset = ChestXray14Dataset( root="data/chestxray14/", )
NCBI ClinVar database of genomic variants and their clinical significance, used for variant pathogenicity classification.
from pyhealth.datasets import ClinVarDataset dataset = ClinVarDataset( root="data/clinvar/", )
Catalogue of Somatic Mutations in Cancer (COSMIC) for predicting the functional impact of somatic mutations in cancer.
from pyhealth.datasets import COSMICDataset dataset = COSMICDataset( root="data/cosmic/", )
TCGA Prostate Adenocarcinoma (PRAD) multi-omics dataset for cancer survival prediction using RNA-seq, CNV, and mutation profiles.
from pyhealth.datasets import TCGAPRADDataset dataset = TCGAPRADDataset( root="data/tcga/", )
Kaggle medical transcriptions dataset with 5K+ clinical notes across 40 specialties for specialty classification and NLP tasks.
from pyhealth.datasets import MedicalTranscriptionsDataset dataset = MedicalTranscriptionsDataset( root="data/medical_transcriptions/", )
Need a custom dataset?
PyHealth makes it easy to add your own data source by inheriting BaseDataset. Your custom dataset immediately gets access to all PyHealth tasks, models, and the trainer — no extra wiring needed.