Development of a novel transcriptomic measure of aging: Transcriptomic Mortality-risk Age (TraMA)

Eric T. Klopack1, Gokul Seshadri2, Thalida Em Arpawong1, Steve Cole3, Bharat Thyagarajan2, Eileen M. Crimmins1

1Leonard Davis School of Gerontology, University of Southern California, Los Angeles, CA 90089, USA
2Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA
3David Geffen School of Medicine, University of California, Los Angeles, CA 90089, USA

Correspondence to: Eric T. Klopack; email: klopack@usc.edu

Keywords: biological aging, transcriptomics, mortality, accelerated aging, machine learning

Received: February 6, 2025    Accepted: June 2, 2025    Published: June 13, 2025

Copyright: © 2025 Klopack et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

ABSTRACT

Increasingly, research suggests that aging is a coordinated multi-system decline in functioning that occurs at multiple biological levels. We developed and validated a transcriptomic (RNA-based) aging measure we call Transcriptomic Mortality-risk Age (TraMA) using RNA-seq data from the 2016 Health and Retirement Study using elastic net Cox regression analyses to predict 4-year mortality hazard. In a holdout test sample, TraMA was associated with earlier mortality, more chronic conditions, poorer cognitive functioning, and more limitations in activities of daily living. TraMA was also externally validated in the Long Life Family Study and several publicly available datasets. Results suggest that TraMA is a robust, portable RNAseq-based aging measure that is comparable, but independent from past biological aging measures (e.g., GrimAge). TraMA is likely to be of particular value to researchers interested in understanding the biological processes underlying health and aging, and for social, psychological, epidemiological, and demographic studies of health and aging.

INTRODUCTION

A growing body of research suggests that aging is a coordinated multi-system decline in functioning that occurs at multiple biological levels (e.g., DNA damage accumulation, cellular aging and senescence, chronic disease morbidity, physical disability) [1, 2]. A major goal of geroscience research is to develop biomarkers of this aging process using minimally invasive methods in humans, as these markers are highly useful in evaluating interventions, understanding social inequalities in health and aging, and researching causes and consequences of accelerated aging in humans [3, 4].

Biomarkers of aging have been developed using combinations of clinical biomarkers [5, 6], DNA methylation (DNAm) [3, 7–10], inflammatory markers [11], telomere length [12], metabolomics [13, 14], and proteomics [15, 16]. These tools have been extremely useful for understanding how social and environmental exposures affect health and aging [17–23], the long-term impact of early life adversity [24–27], how timing of exposure matters for health [28–30], among other important advances.

RNA gene expression may be a particularly valuable tool, as RNA expression is more directly related to genes and gene functioning, compared to DNAm, and may therefore be more easily interpretable [31]. DNAm largely describes what gene may or may not be transcribed; whereas, transcriptomics more directly measures active gene expression [32]. Additionally, research suggests that RNA changes may occur more rapidly than DNAm changes and may capture short-term and long-term responses not captured in DNAm [33]. Thus, RNA- and DNAm-based aging measures may be complementary in studying aging processes.

Previous transcriptomic (RNA-based) aging measures [34] were generally developed using array data (rather than RNA sequencing, which predominate in newer studies), have utilized small, specialty samples, or were estimated in tissue other than blood [34–36]. Indeed, a recent review noted the limitations of existing transcriptomic aging measures and the large number of unknowns about their reproducibility and ability to capture health and mortality risk [31].

At the same time, there has been a proliferation of research utilizing next-generation high throughput RNA sequencing (RNAseq), and several large population-based surveys (e.g., the Health and Retirement Study (HRS), the National Longitudinal Study of Adolescent to Adult Health (Add Health), Midlife in the United States (MIDUS), the Northern Ireland Cohort for the Longitudinal Study of Ageing (NICOLA)) are collecting large RNAseq samples that will be able to address questions about the causes and consequences of transcriptomic aging at the population level.

For these analyses to yield useful generalizable findings, a reliable and portable summary measure of accelerated transcriptomic aging is needed. We developed such a measure here using the 2016 HRS Venous Blood Study (VBS), a nationally representative sample of nearly 4000 US adults aged 50 and older. We utilized elastic net penalized regression to estimate a transcriptomic prediction measure of 4-year mortality risk—Transcriptomic Mortality-risk Age (TraMA)—using more than 10,000 gene transcripts in a training sub-sample. We evaluated this measure in a hold-out testing sub-sample of the HRS, in an external dataset (the Long Life Family Study; LLFS), and in several other publicly available datasets. Our plan of analysis for this study is shown in Figure 1A.

RESULTS

Training TraMA

Because we are interested in developing a measure that is accurate and portable to other human datasets, we restricted the set of genes used for training to coding genes with relatively high expression in human venous blood...

Figure 1. (A) Plan of analysis for the current study. (B) Nested regression results from the HRS testing data...
Figure 2. (A) Regressions of smoker status, cigarette pack years...
Figure 3. (A) Regression of standardized TraMA on cancer diagnosis...

TABLES

Table 1A. Weighted descriptive statistics for the HRS training and testing data.

Training data   Testing data
Mean/Proportion SD Mean/Proportion SD
Age 68.62 9.21 68.66 9.09
...
    

Table 2. Genes (and age) and their coefficients in the TraMA score.

ZNF44 ENSG00000197857 −0.2630
CRYBG3 ENSG00000080200 −0.1964
...
METTL9 ENSG00000197006 0.5401
    

DISCUSSION

Large, population-based studies of aging are collecting omic-level biological data...

RNA-based measure has the capacity to contribute to this highly active and quickly evolving literature. Associations between TraMA and health outcomes were robust and consistent in the HRS testing sample, the LLFS, and other validation samples. Thus, this measure appears to be a useful, portable indicator of the aging process. It appears to explain a large, unique portion of aging-related health outcomes and is associated with health risks in expected directions. Our results show its utility in both large, population-based samples, and smaller clinical, specialty, and community-based samples. We, thus, believe this measure can be a useful tool for researchers interested in understanding the aging process in humans.

METHODS

Cohorts

The Health and Retirement Study (HRS) is an ongoing panel study of older adults since 1992 that is designed to be representative of older US adults when weighted. As part of 2016 data collection, venous blood was collected from a subsample of the HRS. 2.5 ml of blood was collected in PAXgene tubes from about 4000 participants. Total RNA extraction was performed on the QIACube semi-automated method using the PAXgene Blood miRNA Kit. Assays used 200-500 ng of RNA for each sample. All RNA species were extracted and stored for future use. RNA was extracted from only half a PAXgene tube to ensure RNA storage in a variety of formats. Ribosomal RNA and globin reduction was performed using the TruSeq stranded Total Library Prep Gold kit - Ribozero Gold kit. RNAseq was performed on a NovaSeq (Illumina Inc.) using 50 bp paired end reads. All samples were sequenced to a minimum depth of 20 M reads. RNA Seq was successfully performed on 3685 participants. The HRS pipeline closely mirrors the TOPMed/GTEX RNA-Seq analysis pipeline with minor modifications. More information about RNAseq pipelines is available elsewhere [49] including the HRS website (https://hrs.isr.umich.edu/about).

The Long Life Family Study (LLFS) is a longitudinal sample of nearly 5000 participants from 539 families that were selected because of their exceptional longevity. There have been three waves of data collected 6-8 years apart. The first and second waves of data included blood collection. We use data from the first wave to align with HRS. More information is available at the LLFS website https://longlifefamilystudy.com/.

RNA sequencing for Visit 1 was performed using RNA extracted from PAXgene™ Blood RNA tubes, processed with the Qiagen PreAnalytiX PAXgene Blood miRNA Kit. Library preparation, quality control, and sequencing were carried out by the Division of Computation and Data Sciences at Washington University, using the nf core/rnaseq 3.14.0 pipeline for read alignment, duplicate marking, and transcript quantification. Genes with low expression (fewer than 4 counts per million in at least 98.5% of samples) and those with significant intergenic overlap were filtered out. This resulted in a final dataset of 1,810 samples and 16,418 genes. For this study, we utilized RNAseq data from the LLFS dataset, with the filtered raw counts converted to a Log2CPM (counts per million) scale for further analysis.

The COPDGene study (GSE171730) that is publicly available includes 454 current and 767 former smokers, including non-Hispanic White and African American men and women between the ages of 47 and 86 in the US. RNAseq was performed on whole blood using the Illumina HiSeq 2000 platform. More information is available on the COPDGene website (https://copdgene.org/). Information about current smoker status, pack years, and forced expiratory volume in one second (FEV1) predicted percentage, race, sex/gender, and batch are available.

GSE123658 is a sample of 43 healthy donors and 39 type 1 diabetes patients between ages 19 and 73. RNAseq was assessed in whole blood using Illumina NextSeq 500 or HiSeq 4000 platforms. Information about diabetes status, age, and sex/gender are available.

The Mount Sinai Crohn’s and Colitis Registry (GSE186507) includes 821 irritable bowel disease (IBD) patients and 209 healthy controls aged 19 to 82 recruited during an endoscopy appointment from December 2013 to September 2016. RNAseq was assessed in whole blood using the Illumina HiSeq 2500 platform. Information about IBD status, active IBD status (Harvey-Bradshaw index (HBI) ≥ 5), disease severity Simple Endoscopic Score for Crohn’s Disease (SESCD), age, and sex/gender are available.

GSE185263 is a sample of 348 sepsis patients and 44 healthy controls aged 18 to 96 from countries, including Australia, Colombia, the Netherlands, and Canada (sites in Toronto and Vancouver). RNAseq was assessed in whole blood using the Illumina HiSeq 2500 platform. Information about sepsis status, severity using Sequential Organ Failure Assessment (SOFA) scores, mortality, age, and sex/gender are available.

GSE203024 includes blood samples from 1,013 human cancer patients with 11 different types of cancer or colorectal polyps and 1,832 control samples without a cancer diagnosis with RNA profiled on Affymetrix U133 Plus 2.0 GeneChips. Expression values were log2 transformed and missing values were set to the median. Affymetrix IDs were matched to ensembl IDs using biomaRt [37]. If an ensembl ID matched more than one probe, the mean of the values was taken. 34 of the 35 TraMA genes were available.

GSE65391 is a sample of pediatric lupus patients. Expression values were log2 transformed and missing values were set to the median. Illumina IDs were matched to ensembl IDs using biomaRt [37]. If an ensembl ID matched more than one probe, the mean of the values was taken. 32 of the 35 TraMA genes were available.

GSE124612 is a sample of male C57BL/6 mice exposed to 1.5, 3, 6 or 10 Gy of gamma-rays or sham irradiated controls and sacrificed at either 1, 2, 3, 5 or 7 days after exposure. 10 mice were in each experimental group except for 10 Gy on day 7, which only had 8 mice. RNA was profiled using the Agilent-026655 Whole Mouse Genome Microarray. Expression values were log2 transformed and missing values were set to the median. Mus musculus genes were matched to homologous Homo sapiens genes using biomaRt [37]. If an ensembl ID matched more than one probe, the mean of the values was taken. 15 of the 35 TraMA genes were available. Because the ages of all of the mice were equal, age was arbitrarily set 8 to calculate TraMA. Two of the genes used to indicate cell type were not available, so we only use CD3D, CD19, CD4, CD8A, and CD14 for these analyses.