如何根据Logistic回归列线图计算患者风险得分?

1. 背景知识

在上一章中我们介绍了如何基于Cox回归模型列线图计算患者的生存概率及风险得分 。本章我们将介绍如何根据Logistic回归模型构建的列线图计算患者的风险得分。

对于从Nomogram中获得单个患者的风险得分与相应结局发生概率并不困难,但这里有两个技术问题一直困扰着很多研究者:(1) 如何根据Logistic回归列线图精确计算某个患者的风险得分与生存概率?(2) 如何根据Logistic回归列线图精确计算数据集中所有患者得分与生存概率?

如何解决这两个问题呢?下面我们就采用笔者发表在Ann Transl Med杂志的文献《In-depth mining of clinical data: the construction of clinical prediction model with R》[1] 中第82页的案例1(MASS包自带数据集biopsy)演示如何解决这两个问题。

2. 案例解读

[案例1] MASS包中biopsy数据集是一个来自威斯康星州乳腺癌患者数据集。目的是想明确肿瘤活检结果是良性还是恶性。研究者使用细针穿刺(FNA)技术收集样本并进行活检,确定诊断结果(恶性或良性)。数据集包含699个患者的组织样本,保存在有11个变量的数据框中。医疗团队对9个特征进行了评分和编码,评分的范围是1~10。我们的任务是开发尽可能准确的预测模型,以确定肿瘤性质。

3. R语言代码

我们首先先加载MASS包并准备好乳腺癌数据,剔除缺失值:

library(MASS)
biopsy$ID = NULL
names(biopsy) = c("thick", "u.size", "u.shape", "adhsn","s.size", "nucl", "chrom", "n.nuc", "mit", "class")
biopsy<- na.omit(biopsy)

并按照7:3比例把全集随机划分为训练集与验证集:

set.seed(123) #random number generator
ind <- sample(2, nrow(biopsy), replace = TRUE, prob = c(0.7, 0.3))
train <- biopsy[ind==1, ] #the training data set
test <- biopsy[ind==2, ] #the test data set
str(train)
## 'data.frame':    474 obs. of  10 variables:
##  $ thick  : int  5 3 8 1 2 4 2 5 1 8 ...
##  $ u.size : int  1 1 10 1 1 2 1 3 1 7 ...
##  $ u.shape: int  1 1 10 1 1 1 1 3 1 5 ...
##  $ adhsn  : int  1 1 8 1 1 1 1 3 1 10 ...
##  $ s.size : int  2 2 7 2 2 2 2 2 2 7 ...
##  $ nucl   : int  1 2 10 10 1 1 1 3 3 9 ...
##  $ chrom  : int  3 3 9 3 1 2 2 4 3 5 ...
##  $ n.nuc  : int  1 1 7 1 1 1 1 4 1 5 ...
##  $ mit    : int  1 1 1 1 5 1 1 1 1 4 ...
##  $ class  : Factor w/ 2 levels "benign","malignant": 1 1 2 1 1 1 1 2 1 2 ...
##  - attr(*, "na.action")= 'omit' Named int  24 41 140 146 159 165 236 250 276 293 ...
##   ..- attr(*, "names")= chr  "24" "41" "140" "146" ...

打包数据,并基于数据集中6个特征构建预测模型:

library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
##     format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
##     backsolve
dd <- datadist(train)
option <- options(datadist = "dd")
fit1 <- lrm(class ~ thick+u.size+u.shape+nucl+chrom+n.nuc, data=train, x=T, y=T)
nom <- nomogram(fit1, fun = plogis, fun.at = c(.01, seq(.1,.9, by = .2), .999), lp = F, funlabel = "Probability of malignancy")
plot(nom,xfrac=0.4)

图1. Logistic回归列线图

根据预测模型计算训练集中所有患者恶性肿瘤的预测概率及线性预测值:

train$pred1 <- predict(fit1, train, type = "fitted")
train$linear.predictor <- predict(fit1, train, type = "lp")

根据预测模型计算验证集中所有患者恶性肿瘤的预测概率及线性预测值:

test$pred2 <- predict(fit1, test, type = "fitted")
test$linear.predictor <- predict(fit1, test, type = "lp")

使用InformationValue包中misClassError()函数在在训练集中计算预测模型的区分度并使用plotROC()函数绘制ROC曲线:

library(InformationValue)
train$actuals <- ifelse(train$class == "malignant", 1, 0)
misClassError(train$actuals, train$pred1)
## [1] 0.0295
plotROC(train$actuals, train$pred1)

图2. 训练集的ROC曲线及C-statistics

使用InformationValue包中misClassError()函数在在验证集中计算预测模型的区分度并使用plotROC()函数绘制ROC曲线:

test$actuals <- ifelse(test$class == "malignant", 1, 0)
misClassError(test$actuals, test$pred2)
## [1] 0.0239
plotROC(test$actuals, test$pred2)

图3. 验证集的ROC曲线及C-statistics

我们可以使用PredictABEL包中plotCalibration()函数在训练集绘制校准曲线,代码如下:

library(PredictABEL)
## Loading required package: ROCR
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
##     lowess
## Loading required package: epitools
##
## Attaching package: 'epitools'
## The following object is masked from 'package:survival':
##
##     ratetable
## Loading required package: PBSmodelling
##
## -----------------------------------------------------------
## PBS Modelling 2.68.8 -- Copyright (C) 2005-2020 Fisheries and Oceans Canada
##
## A complete user guide 'PBSmodelling-UG.pdf' is located at
## C:/Users/zzr37/Documents/R/win-library/3.6/PBSmodelling/doc/PBSmodelling-UG.pdf
##
## Packaged on 2019-03-12
## Pacific Biological Station, Nanaimo
##
## All available PBS packages can be found at
## https://github.com/pbs-software
## -----------------------------------------------------------
##
## Attaching package: 'PredictABEL'
## The following object is masked from 'package:InformationValue':
##
##     plotROC
plotCalibration(data=train, cOutcome=13, predRisk=train$pred1, groups=30, rangeaxis=c(0,1))

## $Table_HLtest
##                   total meanpred meanobs predicted observed
## 0.00177              23    0.002   0.000      0.04        0
## 0.00258              20    0.003   0.000      0.05        0
## [0.00265,0.00313)    12    0.003   0.000      0.04        0
## [0.00313,0.00387)    25    0.004   0.000      0.09        0
## [0.00387,0.00457)    16    0.004   0.000      0.07        0
## [0.00457,0.00569)    15    0.005   0.000      0.07        0
## [0.00569,0.00739)    37    0.007   0.000      0.25        0
## [0.00739,0.00829)    13    0.008   0.000      0.11        0
## [0.00829,0.01095)    15    0.010   0.000      0.15        0
## [0.01095,0.01270)    15    0.012   0.000      0.18        0
## [0.01270,0.01499)    15    0.014   0.000      0.21        0
## [0.01499,0.01976)    16    0.017   0.000      0.28        0
## [0.01976,0.02450)    15    0.021   0.000      0.32        0
## [0.02450,0.02895)    17    0.028   0.000      0.47        0
## [0.02895,0.03972)    15    0.034   0.000      0.52        0
## [0.03972,0.07250)    16    0.054   0.125      0.87        2
## [0.07250,0.39750)    16    0.145   0.312      2.32        5
## [0.39750,0.84508)    15    0.689   0.733     10.33       11
## [0.84508,0.95901)    16    0.913   0.875     14.62       14
## [0.95901,0.97898)    16    0.969   1.000     15.51       16
## [0.97898,0.99177)    16    0.985   0.938     15.77       15
## [0.99177,0.99433)    16    0.993   0.938     15.89       15
## [0.99433,0.99711)    15    0.996   1.000     14.93       15
## [0.99711,0.99864)    16    0.998   1.000     15.97       16
## [0.99864,0.99937)    16    0.999   1.000     15.98       16
## [0.99937,0.99973)    16    1.000   1.000     15.99       16
## [0.99973,0.99991)    16    1.000   1.000     16.00       16
## [0.99991,0.99999]    15    1.000   1.000     15.00       15
##
## $Chi_square
## [1] NaN
##
## $df
## [1] 28
##
## $p_value
## [1] NaN

我们可以使用PredictABEL包中plotCalibration()函数在验证集绘制校准曲线,代码如下:

plotCalibration(data=test, cOutcome=13, predRisk=test$pred2, groups=30, rangeaxis=c(0,1))

## $Table_HLtest
##                   total meanpred meanobs predicted observed
## 0.00177              14    0.002   0.000      0.02        0
## 0.00258               8    0.003   0.000      0.02        0
## [0.00275,0.00343)     6    0.003   0.000      0.02        0
## [0.00343,0.00390)    13    0.004   0.000      0.05        0
## 0.00390               1    0.004   0.000      0.00        0
## [0.00405,0.00493)     7    0.004   0.000      0.03        0
## 0.00493               7    0.005   0.000      0.03        0
## [0.00629,0.00764)    10    0.007   0.000      0.07        0
## [0.00764,0.00836)     4    0.008   0.000      0.03        0
## [0.00836,0.01200)     9    0.010   0.000      0.09        0
## [0.01200,0.01372)     5    0.012   0.000      0.06        0
## 0.01372               7    0.014   0.000      0.10        0
## [0.01592,0.01820)     8    0.017   0.000      0.14        0
## [0.01820,0.02257)    11    0.020   0.000      0.22        0
## [0.02257,0.02410)     3    0.023   0.000      0.07        0
## [0.02410,0.03485)     6    0.027   0.000      0.16        0
## [0.03485,0.05015)     7    0.038   0.000      0.27        0
## [0.05015,0.10878)     7    0.072   0.000      0.50        0
## [0.10878,0.39171)     7    0.225   0.286      1.58        2
## [0.39171,0.85081)     7    0.592   0.714      4.14        5
## [0.85081,0.95564)     7    0.908   0.857      6.35        6
## [0.95564,0.97828)     7    0.967   0.857      6.77        6
## [0.97828,0.98969)     7    0.984   1.000      6.89        7
## [0.98969,0.99467)     7    0.993   1.000      6.95        7
## [0.99467,0.99728)     7    0.996   1.000      6.97        7
## [0.99728,0.99845)     7    0.998   1.000      6.98        7
## [0.99845,0.99940)     7    0.999   1.000      6.99        7
## [0.99940,0.99988)     7    1.000   1.000      7.00        7
## [0.99988,0.99998]     6    1.000   1.000      6.00        6
##
## $Chi_square
## [1] NaN
##
## $df
## [1] 28
##
## $p_value
## [1] NaN

下面我们使用nomogramFormula包中formula_rd()函数及points_cal()函数根据训练集、验证集及全集中自变量取值计算列线图得分,并将计算的得分做为新的列添加至对应的数据集中。

library(nomogramFormula)
options(option)
results <- formula_rd(nomogram = nom)
train$points <- points_cal(formula = results$formula,rd=train)
test$points <- points_cal(formula = results$formula,rd=test)
biopsy$points <- points_cal(formula = results$formula,rd=biopsy)
head(biopsy$points)
## [1]  60.81471 177.30266  47.35216 203.91259  49.70359 349.65775

4. 总结与讨论

以上介绍了如何基于R语言根据Logistic回归模型计算数据集中单个患者的列线图的得分、线性预测值及结局发生概率。Logistic回归模型的数学公式是我们计算的依据。

5. 参考文献

[1]  Zhou ZR, Wang WW, Li Y, Jin KR, Wang XY, Wang ZW, Chen YS, Wang SJ, Hu J, Zhang HN, Huang P, Zhao GZ, Chen XX, Li B, Zhang TS. In-depth mining of clinical data: the construction of clinical prediction model with R. Ann Transl Med 2019. doi: 10.21037/atm.2019.08.63

统计与绘图

罕见(零)事件系列方法学研究

2020-9-30 23:08:22

统计与绘图

SPSS进行数据转化(变量的转化)的操作步骤演示

2020-10-4 22:04:32

加入Q群
0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
今日签到
有新私信 私信列表
搜索