| Title: | High-Performance Phenotypic Data Pipelines for Breeding |
|---|---|
| Description: | A streamlined toolkit specifically designed for genomic selection and quantitative genetics in animal breeding. It provides high-performance data manipulation backed by 'data.table', focusing on multi-breed and multi-trait nested grouping operations. Features include zero-copy data importing, automated cross-validation splitting, and robust tools to generate and batch-export formatted phenotypic files required by various breeding software (e.g., 'ASReml-R', 'DMU', 'BLUPf90'), heavily optimizing iterative variance component analysis and large-scale evaluation pipelines. |
| Authors: | Guo Meng [aut, cre], Guo Meng [cph] |
| Maintainer: | Guo Meng <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-25 17:08:07 UTC |
| Source: | https://github.com/tony2015116/mintyr |
Generates combinations of specified columns and creates a nested data
structure based on these pairs. Each nested subset renames the combined
columns to value1, value2, ... (up to pairs_n) to
support uniform iterative analyses such as genetic correlation estimation.
c2p_nest(data, cols2bind, by = NULL, pairs_n = 2L, sep = "-", nest_type = "dt")c2p_nest(data, cols2bind, by = NULL, pairs_n = 2L, sep = "-", nest_type = "dt")
data |
A data.frame or data.table to be transformed. |
cols2bind |
A character vector of column names or a numeric vector of
column indices to be combined into pairs. Must not overlap with |
by |
A character vector of column names or a numeric vector of column
indices to group by. Default is |
pairs_n |
A positive integer >= 2 indicating the size of each column
combination (e.g., 2 for pairwise). Default is |
sep |
A single character string used as a separator when constructing
the |
nest_type |
A character string specifying the class of each nested
object: |
The columns specified in cols2bind are renamed to value1,
value2, ... within each nested subset. The original column names are
preserved in the pairs column (e.g., "Sepal.Length-Sepal.Width"),
ensuring full traceability for downstream iterative analyses such as genetic
correlation estimation.
Columns that belong to neither cols2bind nor by (referred to
internally as "extra columns") are retained inside the nested subsets so
that covariates or ID fields remain accessible. Grouping columns (by)
are not duplicated inside the nested data because they are already
present as outer key columns in the returned table.
When the number of requested combinations exceeds 500 a message is emitted; above 5000 a warning is raised, as memory usage grows linearly with the combination count.
A data.table with columns:
Character. The column-combination identifier, e.g.
"Sepal.Length-Sepal.Width".
Any by grouping columns, one per variable.
List-column. Each cell holds a data.table (or data.frame when
nest_type = "df") containing value1, value2, ...,
plus any extra columns that were neither in cols2bind nor
by.
combn for the underlying combination generator.
# Example data preparation: Define column names for combination col_names <- c("Sepal.Length", "Sepal.Width", "Petal.Length") # Example 1: Basic column-to-pairs nesting with custom separator c2p_nest( iris, # Input iris dataset cols2bind = col_names, # Columns to be combined as pairs pairs_n = 2, # Create pairs of 2 columns sep = "&" # Custom separator for pair names ) # Returns a nested data.table where: # - pairs: combined column names (e.g., "Sepal.Length&Sepal.Width") # - data: list column containing data.tables with value1, value2 columns # Example 2: Column-to-pairs nesting with numeric indices and grouping c2p_nest( iris, # Input iris dataset cols2bind = 1:3, # First 3 columns to be combined pairs_n = 2, # Create pairs of 2 columns by = 5 # Group by 5th column (Species) ) # Returns a nested data.table where: # - pairs: combined column names # - Species: grouping variable # - data: list column containing data.tables grouped by Species # Example data preparation: Define column names for combination col_names <- c("Sepal.Length", "Sepal.Width", "Petal.Length") # Example 1: Basic column-to-pairs nesting with custom separator c2p_nest( iris, # Input iris dataset cols2bind = col_names, # Columns to be combined as pairs pairs_n = 2, # Create pairs of 2 columns sep = "&" # Custom separator for pair names ) # Returns a nested data.table where: # - pairs: combined column names (e.g., "Sepal.Length&Sepal.Width") # - data: list column containing data.tables with value1, value2 columns # Example 2: Column-to-pairs nesting with numeric indices and grouping c2p_nest( iris, # Input iris dataset cols2bind = 1:3, # First 3 columns to be combined pairs_n = 2, # Create pairs of 2 columns by = 5 # Group by 5th column (Species) ) # Returns a nested data.table where: # - pairs: combined column names # - Species: grouping variable # - data: list column containing data.tables grouped by Species# Example data preparation: Define column names for combination col_names <- c("Sepal.Length", "Sepal.Width", "Petal.Length") # Example 1: Basic column-to-pairs nesting with custom separator c2p_nest( iris, # Input iris dataset cols2bind = col_names, # Columns to be combined as pairs pairs_n = 2, # Create pairs of 2 columns sep = "&" # Custom separator for pair names ) # Returns a nested data.table where: # - pairs: combined column names (e.g., "Sepal.Length&Sepal.Width") # - data: list column containing data.tables with value1, value2 columns # Example 2: Column-to-pairs nesting with numeric indices and grouping c2p_nest( iris, # Input iris dataset cols2bind = 1:3, # First 3 columns to be combined pairs_n = 2, # Create pairs of 2 columns by = 5 # Group by 5th column (Species) ) # Returns a nested data.table where: # - pairs: combined column names # - Species: grouping variable # - data: list column containing data.tables grouped by Species # Example data preparation: Define column names for combination col_names <- c("Sepal.Length", "Sepal.Width", "Petal.Length") # Example 1: Basic column-to-pairs nesting with custom separator c2p_nest( iris, # Input iris dataset cols2bind = col_names, # Columns to be combined as pairs pairs_n = 2, # Create pairs of 2 columns sep = "&" # Custom separator for pair names ) # Returns a nested data.table where: # - pairs: combined column names (e.g., "Sepal.Length&Sepal.Width") # - data: list column containing data.tables with value1, value2 columns # Example 2: Column-to-pairs nesting with numeric indices and grouping c2p_nest( iris, # Input iris dataset cols2bind = 1:3, # First 3 columns to be combined pairs_n = 2, # Create pairs of 2 columns by = 5 # Group by 5th column (Species) ) # Returns a nested data.table where: # - pairs: combined column names # - Species: grouping variable # - data: list column containing data.tables grouped by Species
Exports every element of a named (or unnamed) list of data.frame /
data.table objects to txt or csv files. Element names may
contain forward-slashes (/) to encode arbitrary subdirectory depth, e.g.
"group_a/subject_01/results" writes
<export_path>/group_a/subject_01/results.txt.
Unnamed elements are automatically labelled split_<i>.
export_list(split_dt, export_path = tempdir(), file_type = "txt")export_list(split_dt, export_path = tempdir(), file_type = "txt")
split_dt |
A non-empty |
export_path |
Single character string - the root export directory.
Created recursively if absent. Defaults to |
file_type |
|
Performance design:
All element names are resolved and path components split in a single vectorised pass before the write loop, so no string work occurs inside the hot path.
Unique subdirectories are collected and created in one batch
(k dir.create() syscalls, where k n).
The field separator is resolved once at function entry.
as.data.table() on an existing data.table is a
reference-pass (no copy).
Error handling:
Individual element failures emit a warning and are skipped; the
remaining elements continue to be processed.
An invisible named character vector of the absolute file paths
written, with length equal to the number of successfully exported elements.
The total count is accessible via length() on the return value.
Requires the data.table package.
# Example: Export split data to files # Step 1: Create split data structure dt_split <- w2l_split( data = iris, # Input iris dataset cols2l = 1:2, # Columns to be split by = "Species" # Grouping variable ) # Step 2: Export split data to files export_list( split_dt = dt_split # Input list of data.tables ) # Returns the number of files created # Files are saved in tempdir() with .txt extension # Check exported files list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE # Search in subdirectories ) # Clean up exported files files <- list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE, # Search in subdirectories full.names = TRUE # Return full file paths ) file.remove(files) # Remove all exported files# Example: Export split data to files # Step 1: Create split data structure dt_split <- w2l_split( data = iris, # Input iris dataset cols2l = 1:2, # Columns to be split by = "Species" # Grouping variable ) # Step 2: Export split data to files export_list( split_dt = dt_split # Input list of data.tables ) # Returns the number of files created # Files are saved in tempdir() with .txt extension # Check exported files list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE # Search in subdirectories ) # Clean up exported files files <- list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE, # Search in subdirectories full.names = TRUE # Return full file paths ) file.remove(files) # Remove all exported files
Exports list-columns containing data.frame or data.table objects from a
data.frame/data.table to txt or csv files, automatically
constructing a hierarchical directory structure from non-nested columns.
Exportable nested columns (those holding data.frame/data.table elements)
are distinguished from non-exportable custom-object columns (e.g. rsplit from
the rsample package); only the former are written to disk by default.
export_nest( nest_dt, group_cols = NULL, nest_cols = NULL, export_path = tempdir(), file_type = "txt" )export_nest( nest_dt, group_cols = NULL, nest_cols = NULL, export_path = tempdir(), file_type = "txt" )
nest_dt |
A |
group_cols |
Optional character vector of column names used to build the
hierarchical output directory structure. When |
nest_cols |
Optional character vector of nested column names to export. When
|
export_path |
Single character string specifying the root export directory.
Defaults to |
file_type |
Either |
Nested column classification (mutually exclusive):
Exportable — every element inherits from data.frame or data.table.
Non-exportable — empty lists or elements of any other class
(e.g. rsplit, vfold_split). Reported to the console; never written.
Directory layout:
export_path / <group1_value> / <group2_value> / <nest_col_name>.<file_type>
Performance notes:
Row data is accessed via .subset2() (zero-copy column access) rather than
nest_dt[i], eliminating per-row data.table allocation in the hot loop.
All n output directory paths are pre-computed in a single vectorised
do.call(file.path, ...) call before the loop; only the k unique paths
are then passed to dir.create(), replacing n syscalls with k
(k <= n; often k << n when many rows share the same group).
The field separator and output filenames are computed once before the loop.
seq_len() is used instead of 1:n to avoid the 1:0 edge-case bug.
All list-column introspection uses vapply with explicit FUN.VALUE to
guarantee return types and prevent silent coercion.
An invisible integer giving the total number of files successfully written.
Returns 0L when no exportable columns are found or all nested data are empty/NULL.
Requires the data.table package for data manipulation and file I/O (fwrite).
# Example 1: Basic nested data export workflow # Step 1: Create nested data structure dt_nest <- w2l_nest( data = iris, # Input iris dataset cols2l = 1:2, # Columns to be nested by = "Species" # Grouping variable ) # Step 2: Export nested data to files export_nest( nest_dt = dt_nest, # Input nested data.table nest_cols = "data", # Column containing nested data group_cols = c("name", "Species") # Columns to create directory structure ) # Returns the number of files created # Creates directory structure: tempdir()/name/Species/data.txt # Check exported files list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE # Search in subdirectories ) # Returns list of created files and their paths # Clean up exported files files <- list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE, # Search in subdirectories full.names = TRUE # Return full file paths ) file.remove(files) # Remove all exported files# Example 1: Basic nested data export workflow # Step 1: Create nested data structure dt_nest <- w2l_nest( data = iris, # Input iris dataset cols2l = 1:2, # Columns to be nested by = "Species" # Grouping variable ) # Step 2: Export nested data to files export_nest( nest_dt = dt_nest, # Input nested data.table nest_cols = "data", # Column containing nested data group_cols = c("name", "Species") # Columns to create directory structure ) # Returns the number of files created # Creates directory structure: tempdir()/name/Species/data.txt # Check exported files list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE # Search in subdirectories ) # Returns list of created files and their paths # Clean up exported files files <- list.files( path = tempdir(), # Default export directory pattern = "txt", # File type pattern to search recursive = TRUE, # Search in subdirectories full.names = TRUE # Return full file paths ) file.remove(files) # Remove all exported files
Format Numeric Columns to Fixed-Decimal Character Strings
format_digits( data, cols = NULL, digits = 2L, percentage = FALSE, nan_as_na = FALSE )format_digits( data, cols = NULL, digits = 2L, percentage = FALSE, nan_as_na = FALSE )
data |
A data.frame or data.table. The input dataset. |
cols |
A character or integer vector specifying columns to format. If NULL (default), all numeric columns are formatted. |
digits |
A non-negative integer specifying decimal places. Defaults to 2. |
percentage |
Logical. If TRUE, values are multiplied by 100 and a "%" sign is appended. Defaults to FALSE. |
nan_as_na |
Logical. If TRUE, NaN is treated identically to NA and coerced to NA_character_. If FALSE (default), NaN is preserved as the string "NaN". |
The function processes columns in the following order:
Validates all input parameters with informative error messages.
Copies the input only once: data.table inputs are deep-copied via
copy(); data.frame inputs are copied implicitly by
as.data.table(), avoiding a redundant second copy.
Resolves cols to a character vector of valid numeric column
names, warning and skipping any non-numeric columns specified.
Applies a vectorised formatting function via
lapply(.SD, fn) and :=, so all target columns are
dispatched in a single data.table call rather than a column-by-column
loop.
NA and NaN handling:
NA_real_ is always returned as NA_character_.
NaN is returned as "NaN" by default. Set
nan_as_na = TRUE to coerce it to NA_character_ instead.
Rounding uses explicit round() before sprintf() to guarantee
consistent results across platforms (Windows, Linux, macOS), where the
underlying C library's rounding behaviour may otherwise differ.
A data.table with the specified numeric columns formatted as character strings. The original object is never modified.
data must be a data.frame or data.table.
Integer column indices in cols are converted to column names
internally; duplicates are silently removed.
digits accepts numeric values such as 2.0 and coerces
them to integer; non-integer-valued numbers raise an error.
The function depends only on data.table and base R.
# Example: Number formatting demonstrations # Setup test data dt <- data.table::data.table( a = c(0.1234, 0.5678), # Numeric column 1 b = c(0.2345, 0.6789), # Numeric column 2 c = c("text1", "text2") # Text column ) # Example 1: Format all numeric columns format_digits( dt, # Input data table digits = 2 # Round to 2 decimal places ) # Example 2: Format specific column as percentage format_digits( dt, # Input data table cols = c("a"), # Only format column 'a' digits = 2, # Round to 2 decimal places percentage = TRUE # Convert to percentage )# Example: Number formatting demonstrations # Setup test data dt <- data.table::data.table( a = c(0.1234, 0.5678), # Numeric column 1 b = c(0.2345, 0.6789), # Numeric column 2 c = c("text1", "text2") # Text column ) # Example 1: Format all numeric columns format_digits( dt, # Input data table digits = 2 # Round to 2 decimal places ) # Example 2: Format specific column as percentage format_digits( dt, # Input data table cols = c("a"), # Only format column 'a' digits = 2, # Round to 2 decimal places percentage = TRUE # Convert to percentage )
get_path_info is a merged, upgraded replacement for get_path_segment and
get_filename. It operates in two modes:
Mode A (when n is specified): Extract a specific path segment by
position, supporting forward indexing, reverse indexing, and range extraction.
Mode B (when n = NULL): Extract the filename, with optional removal
of the file extension and/or the directory prefix.
get_path_info(paths, n = NULL, rm_extension = TRUE, rm_path = TRUE)get_path_info(paths, n = NULL, rm_extension = TRUE, rm_path = TRUE)
paths |
A |
n |
A
|
rm_extension |
A
|
rm_path |
A |
Path normalisation (internal, fully vectorised):
All backslashes and consecutive slashes are collapsed to a single /.
Windows drive letter prefixes (C:, D:, etc.) are stripped.
Leading and trailing / characters are removed.
Paths that are empty after the above steps (e.g. original inputs "C:/",
"/", "") are coerced to NA_character_.
Extension-stripping behaviour (internal .strip_ext helper):
| Input | Output | Notes |
"report.txt" |
"report" |
Standard file — last extension removed |
"data.tar.gz" |
"data.tar" |
Compound extension — only last level removed |
".bashrc" |
".bashrc" |
Pure dot-file (no second dot) — unchanged |
".report.xlsx" |
".report" |
Dot-file with extension — extension removed |
"no_ext" |
"no_ext" |
No extension — returned as-is |
"file." |
"file." |
Trailing isolated dot — returned as-is |
NA safety:
strsplit(NA_character_, ...) returns list(NA) with length 1, not
character(0). Consequently, every vapply callback guards against NA paths
with an explicit anyNA(x) check rather than length(x) == 0.
A character vector of the same length as paths:
Returns the extracted segment string when the segment exists.
Returns NA_character_ when the segment index exceeds the path depth,
the input element is NA, or the path reduces to empty after normalisation
(e.g. "C:/", "/").
base::basename(), tools::file_path_sans_ext()
paths <- c("C:/Users/foo/Documents/report.xlsx", "/home/user/.bashrc", "relative/path/to/data.csv", ".hidden.tar.gz", NA_character_) # Mode B: filename only, extension stripped (default) get_path_info(paths) # Mode B: filename only, extension preserved get_path_info(paths, rm_extension = FALSE) # Mode B: full normalised path, extension stripped get_path_info(paths, rm_path = FALSE) # Mode A: extract the 2nd path segment get_path_info(paths, n = 2) # Mode A: extract the last segment with extension stripped (n = -1 linkage) get_path_info(paths, n = -1, rm_extension = TRUE) # Mode A: range extraction get_path_info(paths, n = c(2, 3))paths <- c("C:/Users/foo/Documents/report.xlsx", "/home/user/.bashrc", "relative/path/to/data.csv", ".hidden.tar.gz", NA_character_) # Mode B: filename only, extension stripped (default) get_path_info(paths) # Mode B: filename only, extension preserved get_path_info(paths, rm_extension = FALSE) # Mode B: full normalised path, extension stripped get_path_info(paths, rm_path = FALSE) # Mode A: extract the 2nd path segment get_path_info(paths, n = 2) # Mode A: extract the last segment with extension stripped (n = -1 linkage) get_path_info(paths, n = -1, rm_extension = TRUE) # Mode A: range extraction get_path_info(paths, n = c(2, 3))
Reads one or more CSV/TXT files using fread as the backend.
Supports flexible combination strategies and source-file tracking. All return values
are data.table objects.
import_csv( file, rbind = TRUE, rbind_label = "_file", full_path = FALSE, keep_ext = FALSE, ... )import_csv( file, rbind = TRUE, rbind_label = "_file", full_path = FALSE, keep_ext = FALSE, ... )
file |
A non-empty |
rbind |
A
|
rbind_label |
A |
full_path |
A
|
keep_ext |
A
|
... |
Additional arguments passed directly to |
Label generation is controlled by the combination of full_path and keep_ext:
full_path = FALSE, keep_ext = FALSE |
Filename without extension: "data" |
full_path = FALSE, keep_ext = TRUE |
Filename with extension: "data.csv" |
full_path = TRUE, keep_ext = FALSE |
Full path without extension: "/path/to/data" |
full_path = TRUE, keep_ext = TRUE |
Full path with extension: "/path/to/data.csv"
|
When rbind = TRUE and rbind_label is not NULL,
rbindlist is called with idcol = rbind_label,
which generates the source column directly during the merge step without any
intermediate copies.
rbind = TRUE: A single data.table containing all imported rows.
If rbind_label is not NULL, the first column contains the source
file label for each row.
rbind = FALSE: A named list of data.table objects.
List names are derived from file paths according to full_path and
keep_ext settings.
All specified files must exist and be readable at call time.
rbind = TRUE assumes compatible column structures across files;
mismatched columns are automatically aligned via fill = TRUE.
Logical parameters (rbind, full_path, keep_ext) reject
NA values explicitly.
# Example: CSV file import demonstrations # Setup test files csv_files <- mintyr_example( mintyr_examples("csv_test") # Get example CSV files ) # Example 1: Import and combine CSV files using data.table import_csv( csv_files, # Input CSV file paths rbind = TRUE, # Combine all files into one data.table rbind_label = "_file", # Column name for file source keep_ext = TRUE, # Include .csv extension in _file column full_path = TRUE # Show complete file paths in _file column )# Example: CSV file import demonstrations # Setup test files csv_files <- mintyr_example( mintyr_examples("csv_test") # Get example CSV files ) # Example 1: Import and combine CSV files using data.table import_csv( csv_files, # Input CSV file paths rbind = TRUE, # Combine all files into one data.table rbind_label = "_file", # Column name for file source keep_ext = TRUE, # Include .csv extension in _file column full_path = TRUE # Show complete file paths in _file column )
A high-performance function for importing data from one or multiple Excel
files into data.table format, with fine-grained control over source
tracking columns, sheet selection, and row skipping.
Performance characteristics (v3 baseline, unchanged here):
excel_sheets() called exactly once per file (cached).
setDT() converts tibbles in-place — zero vector copies.
Tracking columns injected via := on small per-sheet tables
before the single final rbindlist.
data.table OpenMP thread pool optionally widened and always
restored on exit.
import_xlsx( file, rbind = TRUE, sheet = NULL, skip = 0L, show_excel_name = TRUE, show_sheet_name = TRUE, verbose = FALSE, dt_threads = data.table::getDTthreads(), ... )import_xlsx( file, rbind = TRUE, sheet = NULL, skip = 0L, show_excel_name = TRUE, show_sheet_name = TRUE, verbose = FALSE, dt_threads = data.table::getDTthreads(), ... )
file |
Non-empty |
rbind |
|
sheet |
Positive |
skip |
Non-negative |
show_excel_name |
|
show_sheet_name |
|
verbose |
|
dt_threads |
|
... |
Additional arguments forwarded to
|
rbind = TRUEA data.table. Tracking columns
excel_name and/or sheet_name are prepended when their
respective show_* flags are TRUE.
rbind = FALSEA named list of data.tables,
each element named "<filename>_<sheetname>". The list carries a
"source_files" attribute with the original file paths.
# Example: Excel file import demonstrations # Setup test files xlsx_files <- mintyr_example( mintyr_examples("xlsx_test") # Get example Excel files ) # Example 1: Import and combine all sheets from all files import_xlsx( xlsx_files, # Input Excel file paths rbind = TRUE # Combine all sheets into one data.table ) # Example 2: Import specific sheets separately import_xlsx( xlsx_files, # Input Excel file paths rbind = FALSE, # Keep sheets as separate data.tables sheet = 2 # Only import first sheet )# Example: Excel file import demonstrations # Setup test files xlsx_files <- mintyr_example( mintyr_examples("xlsx_test") # Get example Excel files ) # Example 1: Import and combine all sheets from all files import_xlsx( xlsx_files, # Input Excel file paths rbind = TRUE # Combine all sheets into one data.table ) # Example 2: Import specific sheets separately import_xlsx( xlsx_files, # Input Excel file paths rbind = FALSE, # Keep sheets as separate data.tables sheet = 2 # Only import first sheet )
mintyr comes bundled with a number of sample files in
its inst/extdata directory. Use mintyr_example() to retrieve the full file path to a
specific example file.
mintyr_example(path = NULL)mintyr_example(path = NULL)
path |
Name of the example file to locate. If NULL or missing, returns the directory path containing the examples. |
Character string containing the full path to the requested example file.
mintyr_examples() to list all available example files
# Get path to an example file mintyr_example("csv_test1.csv")# Get path to an example file mintyr_example("csv_test1.csv")
mintyr comes bundled with a number of sample files in its inst/extdata
directory. This function lists all available example files, optionally filtered
by a pattern.
mintyr_examples(pattern = NULL)mintyr_examples(pattern = NULL)
pattern |
A regular expression to filter filenames. If |
A character vector containing the names of example files. If no files match the pattern or if the example directory is empty, returns a zero-length character vector.
mintyr_example() to get the full path of a specific example file
# List all example files mintyr_examples()# List all example files mintyr_examples()
nest_cv applies rsample::vfold_cv to each nested data frame within a
data.table, returning an expanded result table containing the corresponding
training and validation splits for each row.
nest_cv( nest_dt, v = 10L, repeats = 1L, strata = NULL, breaks = 4L, pool = 0.1, ... )nest_cv( nest_dt, v = 10L, repeats = 1L, strata = NULL, breaks = 4L, pool = 0.1, ... )
nest_dt |
A |
v |
Number of folds. Must be an integer >= 2. Default is |
repeats |
Number of repeats. Must be an integer >= 1. Default is |
strata |
A single character string specifying the stratification column
name. Set to |
breaks |
Number of bins for stratifying a numeric variable. Only used
when |
pool |
Proportion threshold for pooling small strata. Only used when
|
... |
Additional arguments passed to |
The function performs the following steps:
Validates that nest_dt is a non-empty data.frame or data.table
with at least one nested column whose elements all inherit from
data.frame.
Selects the target nested column: prefers a column named "data";
otherwise falls back to the first detected nested column.
When strata is specified, verifies that the column exists in every
nested data frame before calling rsample::vfold_cv.
Iterates over each row, applies vfold_cv via do.call, expands the
resulting splits into a data.table, and broadcasts the row's
non-nested metadata columns across all CV rows.
Combines all per-row results with rbindlist in a single pass.
A data.table with the following columns:
All non-nested columns from nest_dt (broadcast across CV rows).
splits — cross-validation split objects from rsample::vfold_cv.
id (and id2 for repeated CV) — fold identifiers.
train — list column of training data frames for each split.
validate — list column of validation data frames for each split.
nest_dt must contain at least one nested column of data.frames
or data.tables.
as.data.table() is used instead of data.table::copy(): if the
input is already a data.table, no copy is made.
strata must be a column name present in all nested data frames.
breaks and pool are forwarded to rsample::vfold_cv only when
strata is non-NULL, avoiding invalid argument errors.
The per-row loop with rbindlist corrects a silent bug in naive
chained [ approaches where all rows incorrectly shared the first
row's CV splits.
rsample::vfold_cv() — underlying cross-validation function
rsample::training() — extract training set from a split
rsample::testing() — extract test/validation set from a split
# Example: Cross-validation for nested data.table demonstrations # Setup test data dt_nest <- w2l_nest( data = iris, # Input dataset cols2l = 1:2 # Nest first 2 columns ) # Example 1: Basic 2-fold cross-validation nest_cv( nest_dt = dt_nest, # Input nested data.table v = 2 # Number of folds (2-fold CV) ) # Example 2: Repeated 2-fold cross-validation nest_cv( nest_dt = dt_nest, # Input nested data.table v = 2, # Number of folds (2-fold CV) repeats = 2 # Number of repetitions )# Example: Cross-validation for nested data.table demonstrations # Setup test data dt_nest <- w2l_nest( data = iris, # Input dataset cols2l = 1:2 # Nest first 2 columns ) # Example 1: Basic 2-fold cross-validation nest_cv( nest_dt = dt_nest, # Input nested data.table v = 2 # Number of folds (2-fold CV) ) # Example 2: Repeated 2-fold cross-validation nest_cv( nest_dt = dt_nest, # Input nested data.table v = 2, # Number of folds (2-fold CV) repeats = 2 # Number of repetitions )
A sophisticated data transformation tool for performing row pair conversion
and creating nested data structures. It smartly iterates through variables
to perfectly preserve non-target contextual variables while utilizing
native dcast for extreme performance.
r2p_nest(data, rows2bind, by, nest_type = "dt")r2p_nest(data, rows2bind, by, nest_type = "dt")
data |
Input |
rows2bind |
A character column name or numeric index to be used as row values. |
by |
A character vector or numeric vector of column indices to transform. |
nest_type |
Output nesting format ( |
A nested data.table containing name and data columns, with
all contextual features preserved inside the nested structures.
# Example: Row-to-pairs nesting with column names r2p_nest( mtcars, rows2bind = "cyl", by = c("hp", "drat", "wt") ) # Example 1: Row-to-pairs nesting with column names r2p_nest( mtcars, # Input mtcars dataset rows2bind = "cyl", # Column to be used as row values by = c("hp", "drat", "wt") # Columns to be transformed into pairs ) # Returns a nested data.table where: # - name: variable names (hp, drat, wt) # - data: list column containing data.tables with rows grouped by cyl values # Example 2: Row-to-pairs nesting with numeric indices r2p_nest( mtcars, # Input mtcars dataset rows2bind = 2, # Use 2nd column (cyl) as row values by = 4:6 # Use columns 4-6 (hp, drat, wt) for pairs ) # Returns a nested data.table where: # - name: variable names from columns 4-6 # - data: list column containing data.tables with rows grouped by cyl values# Example: Row-to-pairs nesting with column names r2p_nest( mtcars, rows2bind = "cyl", by = c("hp", "drat", "wt") ) # Example 1: Row-to-pairs nesting with column names r2p_nest( mtcars, # Input mtcars dataset rows2bind = "cyl", # Column to be used as row values by = c("hp", "drat", "wt") # Columns to be transformed into pairs ) # Returns a nested data.table where: # - name: variable names (hp, drat, wt) # - data: list column containing data.tables with rows grouped by cyl values # Example 2: Row-to-pairs nesting with numeric indices r2p_nest( mtcars, # Input mtcars dataset rows2bind = 2, # Use 2nd column (cyl) as row values by = 4:6 # Use columns 4-6 (hp, drat, wt) for pairs ) # Returns a nested data.table where: # - name: variable names from columns 4-6 # - data: list column containing data.tables with rows grouped by cyl values
split_cv applies rsample::vfold_cv to each dataset in a named or
unnamed list, returning a list of data.table objects that each contain
the CV split objects alongside the corresponding training and validation
sets.
split_cv( split_dt, v = 10L, repeats = 1L, strata = NULL, breaks = 4L, pool = 0.1, ... )split_cv( split_dt, v = 10L, repeats = 1L, strata = NULL, breaks = 4L, pool = 0.1, ... )
split_dt |
A |
v |
Number of folds. Must be a single integer >= 2.
Default is |
repeats |
Number of repeats. Must be a single integer >= 1.
Default is |
strata |
A single character string naming the stratification column.
The column must exist in every dataset. Set to |
breaks |
Number of bins when stratifying a numeric variable. Used
only when |
pool |
Proportion threshold for pooling small strata. Used only
when |
... |
Additional arguments forwarded to |
For each dataset in split_dt the function:
Validates inputs once before entering the processing loop.
Builds a vfold_cv argument list, appending stratification
parameters only when strata is non-NULL to avoid passing
unsupported arguments to rsample.
Converts the rsample tibble to a data.table in a single
as.data.table() call, preserving all fold-identifier columns
(id, id2) without hard-coding on the value of repeats.
Appends train and validate list-columns by reference via :=.
A list of data.table objects (one per input dataset), each
containing:
splits — rsample split objects.
id — fold identifier (always present).
id2 — repeat identifier (present only when repeats > 1).
train — list-column of training data frames.
validate — list-column of validation data frames.
The output list preserves the names of split_dt.
When strata is specified, it must exist in all datasets;
a missing column raises an error rather than silently falling back
to unstratified CV.
breaks and pool are forwarded to rsample::vfold_cv only
when strata is non-NULL, preventing invalid-argument errors.
as.data.table() on an already-data.table input is a no-op
(no copy is made).
rsample::vfold_cv() — underlying cross-validation function
rsample::training() — extract training set from a split
rsample::testing() — extract validation set from a split
nest_cv() — nested data.table variant of this utility
# Prepare example data: Convert first 3 columns of iris dataset to long format and split dt_split <- w2l_split(data = iris, cols2l = 1:3) # dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length # Example 1: Single cross-validation (no repeats) split_cv( split_dt = dt_split, # Input list of split data v = 3, # Set 3-fold cross-validation repeats = 1 # Perform cross-validation once (no repeats) ) # Returns a list where each element contains: # - splits: rsample split objects # - id: fold numbers (Fold1, Fold2, Fold3) # - train: training set data # - validate: validation set data # Example 2: Repeated cross-validation split_cv( split_dt = dt_split, # Input list of split data v = 3, # Set 3-fold cross-validation repeats = 2 # Perform cross-validation twice ) # Returns a list where each element contains: # - splits: rsample split objects # - id: repeat numbers (Repeat1, Repeat2) # - id2: fold numbers (Fold1, Fold2, Fold3) # - train: training set data # - validate: validation set data# Prepare example data: Convert first 3 columns of iris dataset to long format and split dt_split <- w2l_split(data = iris, cols2l = 1:3) # dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length # Example 1: Single cross-validation (no repeats) split_cv( split_dt = dt_split, # Input list of split data v = 3, # Set 3-fold cross-validation repeats = 1 # Perform cross-validation once (no repeats) ) # Returns a list where each element contains: # - splits: rsample split objects # - id: fold numbers (Fold1, Fold2, Fold3) # - train: training set data # - validate: validation set data # Example 2: Repeated cross-validation split_cv( split_dt = dt_split, # Input list of split data v = 3, # Set 3-fold cross-validation repeats = 2 # Perform cross-validation twice ) # Returns a list where each element contains: # - splits: rsample split objects # - id: repeat numbers (Repeat1, Repeat2) # - id2: fold numbers (Fold1, Fold2, Fold3) # - train: training set data # - validate: validation set data
Selects the top (largest) or bottom (smallest) percentage of data based on specified traits. Positive percentages extract the largest values; negative percentages extract the smallest values.
top_perc(data, perc, trait, by = NULL, keep_data = FALSE)top_perc(data, perc, trait, by = NULL, keep_data = FALSE)
data |
A data.frame or data.table. |
perc |
A numeric vector strictly between -1 and 1 (excluding 0). Positive values (e.g., 0.05) select the top X% of largest values. Negative values (e.g., -0.1) select the bottom X% of smallest values. |
trait |
A character vector of column names to analyse. |
by |
A character vector of column names to group by. Default is NULL. |
keep_data |
Logical. If TRUE, returns a named list where each element
contains both |
keep_data = FALSE: a data.frame with one row per
by / trait / perc combination, containing columns
n, min, max, mean, median, sd, se, cv, selection.
keep_data = TRUE: a named list (one element per perc
value) where each element is a list with $stat and $data.
# Example 1: Basic usage with single trait # This example selects the top 10% of observations based on Petal.Width # keep_data=TRUE returns both summary statistics and the filtered data top_perc(iris, perc = 0.1, # Select top 10% trait = c("Petal.Width"), # Column to analyze keep_data = TRUE) # Return both stats and filtered data # Example 2: Using grouping with 'by' parameter # This example performs the same analysis but separately for each Species # Returns nested list with stats and filtered data for each group top_perc(iris, perc = 0.1, # Select top 10% trait = c("Petal.Width"), # Column to analyze by = "Species") # Group by Species# Example 1: Basic usage with single trait # This example selects the top 10% of observations based on Petal.Width # keep_data=TRUE returns both summary statistics and the filtered data top_perc(iris, perc = 0.1, # Select top 10% trait = c("Petal.Width"), # Column to analyze keep_data = TRUE) # Return both stats and filtered data # Example 2: Using grouping with 'by' parameter # This example performs the same analysis but separately for each Species # Returns nested list with stats and filtered data for each group top_perc(iris, perc = 0.1, # Select top 10% trait = c("Petal.Width"), # Column to analyze by = "Species") # Group by Species
w2l_nest reshapes a wide-format data.frame or data.table into long
format, then nests the result by name (the pivoted column identifier) and
any optional grouping variables supplied via by. Each row of the returned
table contains a nested data.table or data.frame in the data list-column.
w2l_nest(data, cols2l = NULL, by = NULL, nest_type = "dt")w2l_nest(data, cols2l = NULL, by = NULL, nest_type = "dt")
data |
|
cols2l |
|
by |
|
nest_type |
|
Column resolution: both cols2l and by accept either integer column
positions or character column names. Out-of-bounds indices and unknown names
are caught early with informative error messages.
Overlap guard: if any column appears in both cols2l and by, the
function stops with an error before attempting to melt, preventing silent
structural corruption.
Factor-free melting: melt() is called with variable.factor = FALSE
so the name column is always character, avoiding unexpected factor-level
ordering in downstream grouping operations.
Memory efficiency:
setDT() converts data.frame inputs by reference — no full copy.
.SDcols restricts .SD to non-key columns, eliminating redundant
storage of grouping keys inside each nested object.
For nest_type = "df", setattr(copy(.SD), "class", "data.frame")
modifies the class attribute on a shallow copy rather than performing
a deep column-by-column duplication as as.data.frame() would.
A data.table with one row per combination of name (and by
levels, if provided). The data list-column holds the corresponding
nested data.table or data.frame for each group. Grouping key columns
are never duplicated inside the nested objects.
Passing an empty table (0 rows) triggers a warning() and returns
an empty nested data.table immediately.
cols2l and by must not overlap; overlapping columns will raise
an error.
nest_type values other than "dt" or "df" raise an error.
tidytable::nest_by() for a tidyverse-style equivalent.
# Example: Wide to long format nesting demonstrations # Example 1: Basic nesting by group w2l_nest( data = iris, # Input dataset by = "Species" # Group by Species column ) # Example 2: Nest specific columns with numeric indices w2l_nest( data = iris, # Input dataset cols2l = 1:4, # Select first 4 columns to nest by = "Species" # Group by Species column ) # Example 3: Nest specific columns with column names w2l_nest( data = iris, # Input dataset cols2l = c("Sepal.Length", # Select columns by name "Sepal.Width", "Petal.Length"), by = 5 # Group by column index 5 (Species) ) # Returns similar structure to Example 2# Example: Wide to long format nesting demonstrations # Example 1: Basic nesting by group w2l_nest( data = iris, # Input dataset by = "Species" # Group by Species column ) # Example 2: Nest specific columns with numeric indices w2l_nest( data = iris, # Input dataset cols2l = 1:4, # Select first 4 columns to nest by = "Species" # Group by Species column ) # Example 3: Nest specific columns with column names w2l_nest( data = iris, # Input dataset cols2l = c("Sepal.Length", # Select columns by name "Sepal.Width", "Petal.Length"), by = 5 # Group by column index 5 (Species) ) # Returns similar structure to Example 2
w2l_split reshapes a wide-format data.frame or data.table into long
format, then splits the result into a named list keyed by the pivoted column
identifier (variable) and any optional grouping variables supplied via
by. List element names are derived directly from the grouping key
combinations produced by split(), guaranteeing name-to-content alignment.
w2l_split(data, cols2l = NULL, by = NULL, split_type = "dt", sep = "_")w2l_split(data, cols2l = NULL, by = NULL, split_type = "dt", sep = "_")
data |
|
cols2l |
|
by |
|
split_type |
|
sep |
|
Name safety: list names are produced by data.table::split() itself
using its by argument, not reconstructed from raw row order. This
eliminates the name-to-content misalignment that arises when unique() on
the original data and split()'s internal sort order diverge.
Column resolution: both cols2l and by accept integer column
positions or character column names. Out-of-bounds indices and unknown names
are caught early with informative error messages.
Overlap guard: columns appearing in both cols2l and by raise an
error before melting to prevent id.vars / measure.vars conflicts.
Factor-free melting: melt() is called with variable.factor = FALSE
so the variable column is always character, keeping split() sort order
consistent with lexicographic expectations.
Memory efficiency:
setDT() converts data.frame inputs by reference — no full copy.
For split_type = "df", setattr(copy(x), "class", "data.frame")
modifies the class on a shallow copy, avoiding the deep
column-by-column duplication that as.data.frame() triggers.
A named list of data.table or data.frame objects (controlled by
split_type). Names reflect the key combination of variable (and by
levels if provided), joined by sep.
If by is NULL, the list is keyed by the pivoted column names only.
If by is specified, the list is keyed by variable and all by
level combinations.
An empty input table (0 rows) triggers a warning() and returns an
empty list immediately.
cols2l and by must not overlap; shared columns raise an error.
split_type values other than "dt" or "df" raise an error.
tidytable::group_split() for a tidyverse-style equivalent.
# Example: Wide to long format splitting demonstrations # Example 1: Basic splitting by Species w2l_split( data = iris, # Input dataset by = "Species" # Split by Species column ) |> lapply(head) # Show first 6 rows of each split # Example 2: Split specific columns using numeric indices w2l_split( data = iris, # Input dataset cols2l = 1:3, # Select first 3 columns to split by = 5 # Split by column index 5 (Species) ) |> lapply(head) # Show first 6 rows of each split # Example 3: Split specific columns using column names list_res <- w2l_split( data = iris, # Input dataset cols2l = c("Sepal.Length", # Select columns by name "Sepal.Width"), by = "Species" # Split by Species column ) lapply(list_res, head) # Show first 6 rows of each split # Returns similar structure to Example 2# Example: Wide to long format splitting demonstrations # Example 1: Basic splitting by Species w2l_split( data = iris, # Input dataset by = "Species" # Split by Species column ) |> lapply(head) # Show first 6 rows of each split # Example 2: Split specific columns using numeric indices w2l_split( data = iris, # Input dataset cols2l = 1:3, # Select first 3 columns to split by = 5 # Split by column index 5 (Species) ) |> lapply(head) # Show first 6 rows of each split # Example 3: Split specific columns using column names list_res <- w2l_split( data = iris, # Input dataset cols2l = c("Sepal.Length", # Select columns by name "Sepal.Width"), by = "Species" # Split by Species column ) lapply(list_res, head) # Show first 6 rows of each split # Returns similar structure to Example 2