End-to-End Submissions in R with the Pharmaverse

Pharmaverse Examples

Resource with example datasets

Introduction

Slides

Instructors

Daniel Sjoberg (MSK -> Roche/Genentech)
Becca Krouse (GSK)
Ben Straub (GSK)
Ram Ganapathy (Syneos -> Roche/Genentech)

What is this workshop

Get data into CDISC standards (ADaM and SDTM domains)
Pharmaverse
Align pharma industry on a standard process
Data collection to submission
R packages to support Clinical Reporting in R
Create SDTM from raw data (CDASH and non-CDASH formats)
Create ADaM datasets from SDTM
Multiple ways to create tables

End-to-End

pharmaverse raw
pharmaverse SDTM
pharmaverse ADaM

Exercise 1

Code

# Let's warm-up!

library(dplyr)
library(pharmaverseadam)

# Using dplyr:
#  - From the ADSL dataset:
#   - Subset to the safety population (SAFFL == "Y")
#   - calculate the number of unique subjects in each treatment group (TRT01A)  

# View(pharmaverseadam::adsl)

knitr::kable(
  pharmaverseadam::adsl |> 
    dplyr::filter(SAFFL == "Y") |> 
    dplyr::group_by(TRT01A) |>
    dplyr::summarise(n = dplyr::n_distinct(SUBJID))
)

TRT01A	n
Placebo	86
Xanomeline High Dose	72
Xanomeline Low Dose	96

SDTM Mapping

Slides

SDTM

Study Data Tabulation Model

Mapping raw data to standards
raw (EDC) to SDTM is difficult
SDTM across companies is standard
SDTM -> ADaM is easy

sdtm.oak package

Accommodates varying raw data structures from different EDC systems and vendors

Algorithms

variables with similar mapping algorithms are grouped together
16,000 vars can be grouped into 22 groups
algorithms are backbone of oak

assign_no_ct() -> no controlled terminology

assign_ct() -> 1:1 mapping with controlled terminology

assign_datetime() -> ISO8601 format

hardcode_ct() -> text on EDC (eg units)

Compared to dplyr

do not have to write case_when statements

Topic Variables

Identifier (ID of record)
Qualifier (what is the variable)
Timing (when was variable collected)

EDC Domains

EDC Domain	Code
Demographics	DM
Medical History	MH
Adverse Events	AE
Concomitant Medications	CM
Laboratory Results	LB
Vital Signs	VS
Physical Examination	PE
Study Drug Administration	DA
Subject Disposition	DS
Efficacy Assessments	EF
Safety Assessments	SA
Questionnaires	QS
Imaging Assessments	IMG
Randomization	RAND
Protocol Deviations	PD

Example Raw -> SDTM mapping

Code

vs Domain example
dm domain example

Code

library(sdtm.oak)
library(pharmaverseraw)
library(dplyr)

# AE aCRF - https://github.com/pharmaverse/pharmaverseraw/blob/main/vignettes/articles/aCRFs/AdverseEvent_aCRF.pdf

# Read in Raw dataset ----
ae_raw <- pharmaverseraw::ae_raw

# Generate oak_id_vars ----
ae_raw <- ae_raw %>%
  generate_oak_id_vars(
    pat_var = "PATNUM",
    raw_src = "ae_raw"
  )

# Read in Controlled Terminology
study_ct <-  data.frame(
  codelist_code = c("C66742", "C66742"),
  term_code = c("C49487", "C49488"),
  term_value = c("N", "Y"),
  collected_value = c("No", "Yes"),
  term_preferred_term = c("No", "Yes"),
  term_synonyms = c("No", "Yes"),
  stringsAsFactors = FALSE
)

# Exercise 1 ------------------------------------------------
# Map AETERM from raw_var=IT.AETERM, tgt_var=AETERM
ae <-
  # Derive topic variable
  # Map AETERM using assign_no_ct
  assign_no_ct(
    raw_dat = ae_raw,
    raw_var = "IT.AETERM",
    tgt_var = "AETERM",
    id_vars = oak_id_vars()
  )

# Exercise 2 ------------------------------------------------
# Map AESER from raw_var=IT.AESER, tgt_var=AESER. Codelist code for AESDTH is C66742
ae <- ae %>%
  # Map AESER using ??
  assign_ct(
    raw_dat = ae_raw,
    raw_var = "IT.AESER",
    tgt_var = "AESER",
    ct_spec = study_ct,
    ct_clst = "C66742",
    id_vars = oak_id_vars()
  )

# Exercise 3 ------------------------------------------------
# Map AESDTH from raw_var=IT.AESDTH, tgt_var=AESDTH. Annotation text is 
# If "Yes" then AESDTH = "Y" else Not Submitted. Codelist code for AESDTH is C66742

ae <- ae %>%
  # Map AESDTH using condition_add & assign_ct, raw_var=IT.AESDTH, tgt_var=AESDTH
  assign_ct(
    raw_dat = condition_add(ae_raw, IT.AESDTH == "Yes"),
    raw_var = "IT.AESDTH",
    tgt_var = "AESDTH",
    ct_spec = study_ct,
    ct_clst = "C66742",
    id_vars = oak_id_vars()
  )

ADaM

ADSL - Subject Level Data

Each subject has 1 row/record

ADVS - Vital Signs dataset

ADVS is a basic data structure
focus is on records, not variables

Apply info from Specs
Derive vars and records
Prepare dataset for submissions

Exercise 3

Code

# Exercise 1
# Update date and time imputation arguments so that any dates or times
# that are imputed are the last month/day of the year and 23:59:59

library(tibble)
library(lubridate)
library(admiral)

posit_mh <- tribble(
  ~USUBJID, ~MHSTDTC,
  1,        "2019-07-18T15:25:40",
  1,        "2019-07-18T15:25",
  1,        "2019-07-18",
  2,        "2024-02",
  2,        "2019",
  2,        "2019---07",
  3,        ""
)

paste0("Problem 1")

[1] “Problem 1”

Code

knitr::kable(
  derive_vars_dtm(
    dataset = posit_mh,
    new_vars_prefix = "AST",
    dtc = MHSTDTC,
    highest_imputation = "M",
    date_imputation = "last",
    time_imputation = "last"
  )
)

USUBJID	MHSTDTC	ASTDTM	ASTDTF	ASTTMF
1	2019-07-18T15:25:40	2019-07-18 15:25:40	NA	NA
1	2019-07-18T15:25	2019-07-18 15:25:59	NA	S
1	2019-07-18	2019-07-18 23:59:59	NA	H
2	2024-02	2024-02-29 23:59:59	D	H
2	2019	2019-12-31 23:59:59	M	H
2	2019—07	2019-12-31 23:59:59	M	H
3		NA	NA	NA

Code

# Exercise 2
# Update set_values_to argument for the formula
# MAP Formula: MAP = (SYSBP + 2*DIABP) / 3

ADVS <- tribble(
  ~USUBJID,      ~PARAMCD, ~PARAM,                            ~AVALU,  ~AVAL, ~VISIT,
  "01-701-1015", "DIABP",  "Diastolic Blood Pressure (mmHg)", "mmHg",    51, "BASELINE",
  "01-701-1015", "SYSBP",  "Systolic Blood Pressure (mmHg)",  "mmHg",   121, "BASELINE",
  "01-701-1028", "DIABP",  "Diastolic Blood Pressure (mmHg)", "mmHg",    79, "BASELINE",
  "01-701-1028", "SYSBP",  "Systolic Blood Pressure (mmHg)",  "mmHg",   130, "BASELINE",
) 

paste0("Problem 2")

[1] “Problem 2”

Code

knitr::kable(
  derive_param_computed(
    ADVS,
    by_vars = exprs(USUBJID, VISIT),
    parameters = c("SYSBP", "DIABP"),
    set_values_to = exprs(
      AVAL = (AVAL.SYSBP + 2 * AVAL.DIABP) / 3,
      PARAMCD = "MAP",
      PARAM = "Mean Arterial Pressure (mmHg)",
      AVALU = "mmHg",
    )
  )
)

USUBJID	PARAMCD	PARAM	AVALU	AVAL	VISIT
01-701-1015	DIABP	Diastolic Blood Pressure (mmHg)	mmHg	51.00000	BASELINE
01-701-1015	SYSBP	Systolic Blood Pressure (mmHg)	mmHg	121.00000	BASELINE
01-701-1028	DIABP	Diastolic Blood Pressure (mmHg)	mmHg	79.00000	BASELINE
01-701-1028	SYSBP	Systolic Blood Pressure (mmHg)	mmHg	130.00000	BASELINE
01-701-1015	MAP	Mean Arterial Pressure (mmHg)	mmHg	74.33333	BASELINE
01-701-1028	MAP	Mean Arterial Pressure (mmHg)	mmHg	96.00000	BASELINE

ARDs - Analysis Results Datasets

tabulate and summarise Cat and Cont vars
cards does summary stats
cardx does statistical analysis

Exercise 4

Code

# ARD Exercise: Adverse Events summaries using {cards}


# Setup: run this first! --------------------------------------------------

# Load necessary packages
library(cards)

# Import & subset data
adsl <- pharmaverseadam::adsl |> 
  dplyr::filter(SAFFL=="Y")

adae <- pharmaverseadam::adae |> 
  dplyr::filter(SAFFL=="Y") |> 
  dplyr::filter(AESOC %in% unique(AESOC)[1:3]) |> 
  dplyr::group_by(AESOC) |> 
  dplyr::filter(AEDECOD %in% unique(AEDECOD)[1:3]) |> 
  dplyr::ungroup()

# Exercise ----------------------------------------------------------------

# A. Calculate the number and percentage of *unique* subjects with at least one AE:
#  - By each SOC (AESOC)
#  - By each Preferred term (AEDECOD) within SOC (AESOC)
# By every combination of treatment group (ARM) 

ard_stack_hierarchical(
  data = adae,
  variables = c(AESOC,AEDECOD),
  by = ARM, 
  id = USUBJID,
  denominator = adsl
)

group1 group1_level group2 group2_level variable variable_level stat_name 1 ARM Placebo n 2 ARM Placebo N 3 ARM Placebo p 4 ARM Xanomeli… n 5 ARM Xanomeli… N 6 ARM Xanomeli… p 7 ARM Xanomeli… n 8 ARM Xanomeli… N 9 ARM Xanomeli… p 10 ARM Placebo AESOC GASTROIN… n stat_label stat 1 n 86 2 N 254 3 % 0.339 4 n 84 5 N 254 6 % 0.331 7 n 84 8 N 254 9 % 0.331 10 n 12

Code

# B. [*BONUS*] Modify the code from part A to include overall number/percentage of
# subjects with at least one AE, regardless of SOC and PT

ard_stack_hierarchical(
  data = adae,
  variables = c(AESOC, AEDECOD),
  by = ARM, 
  id = USUBJID,
  denominator = adsl,
  over_variables = TRUE
)

group1 group1_level group2 group2_level variable 1 ARM 2 ARM 3 ARM 4 ARM 5 ARM 6 ARM 7 ARM 8 ARM 9 ARM 10 ARM Placebo ..ard_hierarchical_overall.. variable_level stat_name stat_label stat 1 Placebo n n 86 2 Placebo N N 254 3 Placebo p % 0.339 4 Xanomeli… n n 84 5 Xanomeli… N N 254 6 Xanomeli… p % 0.331 7 Xanomeli… n n 84 8 Xanomeli… N N 254 9 Xanomeli… p % 0.331 10 TRUE n n 31

tfrmt - Nicely formatting ARDs

Code

# Table Exercise: AE summary table using {tfrmt}

# For this exercise, we will use the AE ARD from the last section to
# create a {tfrmt} table


# Setup: run this first! --------------------------------------------------

## Load necessary packages
library(cards)
library(dplyr)
library(tidyr)
library(tfrmt)

## Import & subset data
adsl <- pharmaverseadam::adsl |> 
  dplyr::filter(SAFFL=="Y")

adae <- pharmaverseadam::adae |> 
  dplyr::filter(SAFFL=="Y") |> 
  dplyr::filter(AESOC %in% unique(AESOC)[1:3]) |> 
  dplyr::group_by(AESOC) |> 
  dplyr::filter(AEDECOD %in% unique(AEDECOD)[1:3]) |> 
  dplyr::ungroup()

## Create AE Summary using cards
ard_ae <- ard_stack_hierarchical(
  data = adae,
  variables = c(AESOC, AEDECOD),
  by = ARM, 
  id = USUBJID,
  denominator = adsl,
  over_variables = TRUE,
  statistic = ~ c("n", "p")
) 


# Exercise ----------------------------------------------------------------

# A. Convert `cards` object into a tidy data frame ready for {tfrmt}. 
#    Nothing to do besides run each step & explore the output!

ard_ae_tidy <- ard_ae |> 
  shuffle_card(fill_hierarchical_overall = "ANY EVENT") |> 
  prep_big_n(vars = "ARM") |> 
  prep_hierarchical_fill(vars = c("AESOC","AEDECOD"),
                         fill_from_left = TRUE) |> 
  dplyr::select(-c(context, stat_label, stat_variable)) 


# B. Create a basic tfrmt, filling in the missing variable names

ae_tfrmt <- tfrmt(
  group = AESOC,
  label = AEDECOD,
  param = , # fill
  value = , # fill
  column = , # fill
  body_plan = body_plan(
    frmt_structure(group_val = ".default", label_val = ".default", 
                   frmt_combine(
                     "{n} ({p}%)",
                     n = frmt("xx"),
                     p = frmt("xx", transform = ~ . *100)
                   )
    )
  ),
  big_n = big_n_structure(param_val = "bigN") 
) 

print_to_gt(ae_tfrmt,
            ard_ae_tidy)


# C. Switch the order of the columns so Placebo is last

ae_tfrmt <- ae_tfrmt |> 
  tfrmt(
    col_plan = col_plan(
      "Placebo",
      starts_with("Xanomeline")
    )
  )  

print_to_gt(ae_tfrmt, ard_ae_tidy)


# D. Add a title and source note for the table

ae_tfrmt <- ae_tfrmt |> 
  tfrmt(
    title = "", # fill
    footnote_plan = footnote_plan(
      footnote_structure("") # fill with footnote text
    ) 
  )

print_to_gt(ae_tfrmt, ard_ae_tidy)

gtsummary - more tables

How to adopt gtsummary at your company
Large user base, catch edge cases

teal - helps build shinys

https://insightsengineering.github.io/teal/latest-tag/

How to contribute

use the package
write a blog or create a template
submit issues on git
join as a contributor