The dataSDA package (v0.2.4) gathers various symbolic
data tailored to different research themes and provides a comprehensive
set of functions for reading, writing, converting, and analyzing
symbolic data. The package is available on CRAN at https://CRAN.R-project.org/package=dataSDA and on GitHub
at https://github.com/hanmingwu1103/dataSDA.
The package includes 114 datasets spanning seven types of symbolic data. Each dataset name uses a suffix that indicates its type:
| Type | Suffix | Datasets | Description |
|---|---|---|---|
| Interval | .int, .iGAP,
.int.mm |
57 | Interval-valued data in RSDA (54), iGAP (2), and min-max (1) formats |
| Histogram | .hist |
25 | Histogram-valued distributional data |
| Mixed | .mix |
11 | Datasets combining interval and categorical variables |
| Interval Time Series | .its |
9 | Interval-valued time series data |
| Modal | .modal |
7 | Modal multi-valued symbolic data |
| Distributional | .distr |
3 | Distributional symbolic data |
| Other | ??? | 2 | Auxiliary datasets (bank_rates,
hierarchy) |
| Total | 114 |
| Type | Datasets |
|---|---|
| Interval (.int) | 54 |
| Histogram (.hist) | 25 |
| Mixed (.mix) | 11 |
| Interval Time Series (.its) | 9 |
| Modal (.modal) | 7 |
| Distributional (.distr) | 3 |
| Interval (.iGAP) | 2 |
| Other | 2 |
| Interval (.int.mm) | 1 |
The package provides functions organized into the following categories:
| Category | Functions | Count |
|---|---|---|
| Format detection & conversion | int_detect_format,
int_list_conversions, int_convert_format,
RSDA_to_MM, iGAP_to_MM,
SODAS_to_MM, MM_to_iGAP,
RSDA_to_iGAP, SODAS_to_iGAP,
MM_to_RSDA, iGAP_to_RSDA |
11 |
| Core statistics | int_mean, int_var,
int_cov, int_cor |
4 |
| Geometric properties | int_width, int_radius,
int_center, int_midrange,
int_overlap, int_containment |
6 |
| Position & scale | int_median, int_quantile,
int_range, int_iqr, int_mad,
int_mode |
6 |
| Robust statistics | int_trimmed_mean,
int_winsorized_mean, int_trimmed_var,
int_winsorized_var |
4 |
| Distribution shape | int_skewness, int_kurtosis,
int_symmetry, int_tailedness |
4 |
| Similarity measures | int_jaccard, int_dice,
int_cosine, int_overlap_coefficient,
int_tanimoto, int_similarity_matrix |
6 |
| Uncertainty & variability | int_entropy, int_cv,
int_dispersion, int_imprecision,
int_granularity, int_uniformity,
int_information_content |
7 |
| Distance measures | int_dist, int_dist_matrix,
int_pairwise_dist, int_dist_all |
4 |
| Histogram statistics | hist_mean, hist_var,
hist_cov, hist_cor |
4 |
| Utilities | clean_colnames, RSDA_format,
set_variable_format, read_symbolic_csv,
write_symbolic_csv |
5 |
The dataSDA package works with three primary formats for
interval-valued data:
symbolic_tbl objects
where intervals are encoded as complex numbers
(min + max*i). Used by the RSDA package._min / _max columns for each variable."2.5,4.0").data(mushroom.int)
head(mushroom.int, 3)
#> # A tibble: 3 × 5
#> Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> <chr> <symblc_n> <symblc_n> <symblc_n> <chr>
#> 1 arorae [3.00 : 8.00] [4.00 : 9.00] [0.50 : 2.50] U
#> 2 arvenis [6.00 : 21.00] [4.00 : 14.00] [1.00 : 3.50] Y
#> 3 benesi [4.00 : 8.00] [5.00 : 11.00] [1.00 : 2.00] Y
class(mushroom.int)
#> [1] "symbolic_tbl" "tbl_df" "tbl" "data.frame"data(abalone.int)
head(abalone.int, 3)
#> Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12 0.1275 0.9975 0.075 0.815 -0.0175 0.3125
#> F-13-15 0.1775 1.0275 0.125 0.825 0.025 0.325
#> F-16-18 0.22 0.92 0.1725 0.7425 0.0375 0.3075
#> Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12 -1.021 3.883 -0.6322 2.1948 -0.2077 0.7712
#> F-13-15 -0.8567 3.6303 -0.4548 1.7942 -0.1905 0.7555
#> F-16-18 -0.5725 3.1235 -0.244 1.206 -0.1037 0.6752
#> Shell_min Shell_max
#> F-10-12 -0.258 1.054
#> F-13-15 -0.269 1.153
#> F-16-18 -0.3233 1.4477
class(abalone.int)
#> [1] "data.frame"data(abalone.iGAP)
head(abalone.iGAP, 3)
#> Length Diameter Height Whole
#> F-10-12 0.1275,0.9975 0.075, 0.815 -0.0175, 0.3125 -1.021, 3.883
#> F-13-15 0.1775,1.0275 0.125,0.825 0.025, 0.325 -0.8567, 3.6303
#> F-16-18 0.22,0.92 0.1725, 0.7425 0.0375, 0.3075 -0.5725, 3.1235
#> Shucked Viscera Shell
#> F-10-12 -0.6322, 2.1948 -0.2077, 0.7712 -0.258, 1.054
#> F-13-15 -0.4548, 1.7942 -0.1905, 0.7555 -0.269, 1.153
#> F-16-18 -0.244, 1.206 -0.1037, 0.6752 -0.3233, 1.4477
class(abalone.iGAP)
#> [1] "data.frame"The int_detect_format() function automatically
identifies the format of a dataset:
int_detect_format(mushroom.int)
#> [1] "RSDA"
int_detect_format(abalone.int)
#> [1] "MM"
int_detect_format(abalone.iGAP)
#> [1] "iGAP"Use int_list_conversions() to see all available format
conversion paths:
int_list_conversions()
#> from to function_name
#> 1 RSDA MM RSDA_to_MM
#> 2 RSDA iGAP RSDA_to_iGAP
#> 3 RSDA ARRAY RSDA_to_ARRAY
#> 4 MM iGAP MM_to_iGAP
#> 5 MM RSDA MM_to_RSDA
#> 6 MM ARRAY MM_to_ARRAY
#> 7 iGAP MM iGAP_to_MM
#> 8 iGAP RSDA iGAP_to_RSDA
#> 9 iGAP ARRAY iGAP_to_ARRAY
#> 10 ARRAY RSDA ARRAY_to_RSDA
#> 11 ARRAY MM ARRAY_to_MM
#> 12 ARRAY iGAP ARRAY_to_iGAP
#> 13 SODAS MM SODAS_to_MM
#> 14 SODAS iGAP SODAS_to_iGAP
#> 15 SODAS ARRAY SODAS_to_ARRAYThe int_convert_format() function provides a unified
interface for converting between formats. It auto-detects the source
format and applies the appropriate conversion:
# RSDA to MM
mushroom.MM <- int_convert_format(mushroom.int, to = "MM")
head(mushroom.MM, 3)
#> Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1 arorae 3 8 4
#> 2 arvenis 6 21 4
#> 3 benesi 4 8 5
#> Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1 9 0.5 2.5 U
#> 2 14 1.0 3.5 Y
#> 3 11 1.0 2.0 Y# iGAP to MM
abalone.MM <- int_convert_format(abalone.iGAP, to = "MM")
head(abalone.MM, 3)
#> Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12 0.1275 0.9975 0.075 0.815 -0.0175 0.3125
#> F-13-15 0.1775 1.0275 0.125 0.825 0.025 0.325
#> F-16-18 0.22 0.92 0.1725 0.7425 0.0375 0.3075
#> Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12 -1.021 3.883 -0.6322 2.1948 -0.2077 0.7712
#> F-13-15 -0.8567 3.6303 -0.4548 1.7942 -0.1905 0.7555
#> F-16-18 -0.5725 3.1235 -0.244 1.206 -0.1037 0.6752
#> Shell_min Shell_max
#> F-10-12 -0.258 1.054
#> F-13-15 -0.269 1.153
#> F-16-18 -0.3233 1.4477# iGAP to RSDA
data(face.iGAP)
face.RSDA <- int_convert_format(face.iGAP, to = "RSDA")
head(face.RSDA, 3)
#> AD BC AH DH EH
#> 1 155.00+157.00i 58+61.01i 100.45+103.28i 105.00+107.30i 61.40+65.73i
#> 2 154.00+160.01i 57+64.00i 101.98+105.55i 104.35+107.30i 60.88+63.03i
#> 3 154.01+161.00i 57+63.00i 99.36+105.65i 101.04+109.04i 60.95+65.60i
#> GH
#> 1 64.20+67.80i
#> 2 62.94+66.47i
#> 3 60.42+66.40iFor explicit control, direct conversion functions are available:
# RSDA to MM
mushroom.MM <- RSDA_to_MM(mushroom.int, RSDA = TRUE)
head(mushroom.MM, 3)
#> Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1 arorae 3 8 4
#> 2 arvenis 6 21 4
#> 3 benesi 4 8 5
#> Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1 9 0.5 2.5 U
#> 2 14 1.0 3.5 Y
#> 3 11 1.0 2.0 Y# MM to iGAP
mushroom.iGAP <- MM_to_iGAP(mushroom.MM)
head(mushroom.iGAP, 3)
#> Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> 1 arorae 3,8 4,9 0.5,2.5 U
#> 2 arvenis 6,21 4,14 1,3.5 Y
#> 3 benesi 4,8 5,11 1,2 Y# iGAP to MM
data(face.iGAP)
face.MM <- iGAP_to_MM(face.iGAP, location = 1:6)
head(face.MM, 3)
#> AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max
#> FRA1 155.00 157.00 58.00 61.01 100.45 103.28 105.00 107.30 61.40 65.73
#> FRA2 154.00 160.01 57.00 64.00 101.98 105.55 104.35 107.30 60.88 63.03
#> FRA3 154.01 161.00 57.00 63.00 99.36 105.65 101.04 109.04 60.95 65.60
#> GH_min GH_max
#> FRA1 64.20 67.80
#> FRA2 62.94 66.47
#> FRA3 60.42 66.40# MM to RSDA
face.RSDA <- MM_to_RSDA(face.MM)
head(face.RSDA, 3)
#> AD BC AH DH EH
#> 1 155.00+157.00i 58+61.01i 100.45+103.28i 105.00+107.30i 61.40+65.73i
#> 2 154.00+160.01i 57+64.00i 101.98+105.55i 104.35+107.30i 60.88+63.03i
#> 3 154.01+161.00i 57+63.00i 99.36+105.65i 101.04+109.04i 60.95+65.60i
#> GH
#> 1 64.20+67.80i
#> 2 62.94+66.47i
#> 3 60.42+66.40i
class(face.RSDA)
#> [1] "symbolic_tbl" "data.frame"# iGAP to RSDA (direct, one-step)
abalone.RSDA <- iGAP_to_RSDA(abalone.iGAP, location = 1:7)
head(abalone.RSDA, 3)
#> Length Diameter Height Whole Shucked
#> 1 0.1275+0.9975i 0.0750+0.8150i -0.0175+0.3125i -1.0210+3.8830i -0.6322+2.1948i
#> 2 0.1775+1.0275i 0.1250+0.8250i 0.0250+0.3250i -0.8567+3.6303i -0.4548+1.7942i
#> 3 0.2200+0.9200i 0.1725+0.7425i 0.0375+0.3075i -0.5725+3.1235i -0.2440+1.2060i
#> Viscera Shell
#> 1 -0.2077+0.7712i -0.2580+1.0540i
#> 2 -0.1905+0.7555i -0.2690+1.1530i
#> 3 -0.1037+0.6752i -0.3233+1.4477i
class(abalone.RSDA)
#> [1] "symbolic_tbl" "data.frame"# RSDA to iGAP
mushroom.iGAP2 <- RSDA_to_iGAP(mushroom.int)
head(mushroom.iGAP2, 3)
#> Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> 1 arorae 3,8 4,9 0.5,2.5 U
#> 2 arvenis 6,21 4,14 1,3.5 Y
#> 3 benesi 4,8 5,11 1,2 YThe SODAS_to_MM() and SODAS_to_iGAP()
functions convert SODAS XML files but require an XML file path and are
not demonstrated here.
The traditional workflow for converting a raw data frame into the
symbolic_tbl class used by RSDA involves
several steps. We illustrate with the mushroom dataset,
which contains 23 species described by 3 interval-valued variables and 2
categorical variables.
data(mushroom.int.mm)
head(mushroom.int.mm, 3)
#> Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1 arorae 3 8 4
#> 2 arvenis 6 21 4
#> 3 benesi 4 8 5
#> Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1 9 0.5 2.5 U
#> 2 14 1.0 3.5 Y
#> 3 11 1.0 2.0 YFirst, use set_variable_format() to create
pseudo-variables for each category using one-hot encoding:
mushroom_set <- set_variable_format(data = mushroom.int.mm, location = 8,
var = "Species")
head(mushroom_set, 3)
#> Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 23 1 0 0 0 0 0 0
#> 2 23 0 1 0 0 0 0 0
#> 3 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> perobscurus semotus silvicola subrutilescens xanthodermus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min Stipe.Length_max
#> 1 3 8 4 9
#> 2 6 21 4 14
#> 3 4 8 5 11
#> Stipe.Thickness_min Stipe.Thickness_max Edibility U Y T
#> 1 0.5 2.5 3 1 0 0
#> 2 1.0 3.5 3 0 1 0
#> 3 1.0 2.0 3 0 1 0Next, apply RSDA_format() to prefix each variable with
$I (interval) or $S (set) tags:
mushroom_tmp <- RSDA_format(data = mushroom_set,
sym_type1 = c("I", "I", "I", "S"),
location = c(25, 27, 29, 31),
sym_type2 = c("S"),
var = c("Species"))
head(mushroom_tmp, 3)
#> $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S 23 1 0 0 0 0 0 0
#> 2 $S 23 0 1 0 0 0 0 0
#> 3 $S 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> perobscurus semotus silvicola subrutilescens xanthodermus $I
#> 1 0 0 0 0 0 $I
#> 2 0 0 0 0 0 $I
#> 3 0 0 0 0 0 $I
#> Pileus.Cap.Width_min Pileus.Cap.Width_max $I Stipe.Length_min
#> 1 3 8 $I 4
#> 2 6 21 $I 4
#> 3 4 8 $I 5
#> Stipe.Length_max $I Stipe.Thickness_min Stipe.Thickness_max $S Edibility U Y
#> 1 9 $I 0.5 2.5 $S 3 1 0
#> 2 14 $I 1.0 3.5 $S 3 0 1
#> 3 11 $I 1.0 2.0 $S 3 0 1
#> T
#> 1 0
#> 2 0
#> 3 0Clean up variable names with clean_colnames() and write
to CSV with write_symbolic_csv():
mushroom_clean <- clean_colnames(data = mushroom_tmp)
head(mushroom_clean, 3)
#> $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S 23 1 0 0 0 0 0 0
#> 2 $S 23 0 1 0 0 0 0 0
#> 3 $S 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1 0 0 0 0 $I 3
#> 2 0 0 0 0 $I 6
#> 3 0 0 0 0 $I 4
#> Pileus.Cap.Width $I Stipe.Length Stipe.Length $I Stipe.Thickness
#> 1 8 $I 4 9 $I 0.5
#> 2 21 $I 4 14 $I 1.0
#> 3 8 $I 5 11 $I 1.0
#> Stipe.Thickness $S Edibility U Y T
#> 1 2.5 $S 3 1 0 0
#> 2 3.5 $S 3 0 1 0
#> 3 2.0 $S 3 0 1 0write_symbolic_csv(mushroom_clean, file = "mushroom_interval.csv")
mushroom_int <- read_symbolic_csv(file = "mushroom_interval.csv")
head(mushroom_int, 3)
#> $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S 23 1 0 0 0 0 0 0
#> 2 $S 23 0 1 0 0 0 0 0
#> 3 $S 23 0 0 1 0 0 0 0
#> campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 0 0 0 0 0 0
#> semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1 0 0 0 0 $I 3
#> 2 0 0 0 0 $I 6
#> 3 0 0 0 0 $I 4
#> Pileus.Cap.Width.1 $I.1 Stipe.Length Stipe.Length.1 $I.2 Stipe.Thickness
#> 1 8 $I 4 9 $I 0.5
#> 2 21 $I 4 14 $I 1.0
#> 3 8 $I 5 11 $I 1.0
#> Stipe.Thickness.1 $S.1 Edibility U Y T
#> 1 2.5 $S 3 1 0 0
#> 2 3.5 $S 3 0 1 0
#> 3 2.0 $S 3 0 1 0
class(mushroom_int)
#> [1] "data.frame"Note: The MM_to_RSDA() function provides a simpler
one-step alternative to this workflow.
Histogram-valued data uses the MatH class from the
HistDAWass package. The built-in BLOOD dataset
is a MatH object with 14 patient groups and 3
distributional variables:
BLOOD[1:3, 1:2]
#> a matrix of distributions
#> 2 variables 3 rows
#> each distibution in the cell is represented by the mean and the standard deviation
#> Cholesterol Hemoglobin
#> u1: F-20 [m= 150.1 ,s= 26.336 ] [m= 13.695 ,s= 0.55031 ]
#> u2: F-30 [m= 150.71 ,s= 25.284 ] [m= 12.158 ,s= 0.52834 ]
#> u3: F-40 [m= 164.96 ,s= 25.334 ] [m= 12.134 ,s= 0.50739 ]Below we illustrate constructing a MatH object from raw
histogram data:
A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)
ListOfWeight <- list(
distributionH(A1, B1),
distributionH(A2, B2),
distributionH(A3, B3)
)
Weight <- methods::new("MatH",
nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
names.rows = c("20s", "30s", "40s"),
names.cols = c("weight"), by.row = FALSE)
Weight
#> a matrix of distributions
#> 1 variables 3 rows
#> each distibution in the cell is represented by the mean and the standard deviation
#> weight
#> 20s [m= 86.8 ,s= 13.824 ]
#> 30s [m= 84.1 ,s= 14.44 ]
#> 40s [m= 82.9 ,s= 14.385 ]Many dataSDA functions accept a method
parameter that determines how interval boundaries are used in
computations. The eight available methods (Wu, Kao and Chen, 2020)
are:
| Method | Name | Description |
|---|---|---|
| CM | Center Method | Uses the midpoint (center) of each interval |
| VM | Vertices Method | Uses both endpoints of the intervals |
| QM | Quantile Method | Uses a quantile-based representation |
| SE | Stacked Endpoints Method | Stacks the lower and upper values of an interval |
| FV | Fitted Values Method | Fits a linear regression model |
| EJD | Empirical Joint Density Method | Joint distribution of lower and upper bounds |
| GQ | Symbolic Covariance Method | Alternative expression of the symbolic sample variance |
| SPT | Total Sum of Products | Decomposition of the SPT |
Quick demonstration:
The core statistical functions int_mean,
int_var, int_cov, and int_cor
compute descriptive statistics for interval-valued data across any
combination of the eight methods.
We compute the mean and variance of Pileus.Cap.Width and
Stipe.Length in the mushroom.int dataset using
all eight interval methods.
data(mushroom.int)
var_name <- c("Pileus.Cap.Width", "Stipe.Length")
method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT")
mean_mat <- int_mean(mushroom.int, var_name, method)
mean_mat
#> Pileus.Cap.Width Stipe.Length
#> CM 7.978261 7.391304
#> VM 7.978261 7.391304
#> QM 7.978261 7.391304
#> SE 7.978261 7.391304
#> FV 11.239130 10.304348
#> EJD 7.978261 7.391304
#> GQ 7.978261 7.391304
#> SPT 7.978261 7.391304
var_mat <- int_var(mushroom.int, var_name, method)
var_mat
#> Pileus.Cap.Width Stipe.Length
#> CM 11.46542 9.544466
#> VM 25.74677 19.911132
#> QM 18.37672 14.538520
#> SE 26.03285 20.132367
#> FV 15.65881 13.858573
#> EJD 15.80025 12.651229
#> GQ 15.80025 12.651229
#> SPT 15.80025 12.651229The means are identical across most methods because methods other than FV operate on the same midpoint or boundary values; only FV (which regresses upper bounds on lower bounds) produces a different mean. In contrast, the variances differ substantially across methods, reflecting how each method weighs interval width and position.
cols <- c("#4E79A7", "#F28E2B")
par(mfrow = c(2, 1), mar = c(5, 4, 3, 6), las = 2, xpd = TRUE)
# --- Mean across eight methods ---
bp <- barplot(t(mean_mat), beside = TRUE, col = cols,
main = "Interval Mean by Method (mushroom.int)",
ylab = "Mean",
ylim = c(0, max(mean_mat) * 1.25))
legend("topright", inset = c(-0.18, 0),
legend = colnames(mean_mat), fill = cols, bty = "n", cex = 0.85)
# --- Variance across eight methods ---
bp <- barplot(t(var_mat), beside = TRUE, col = cols,
main = "Interval Variance by Method (mushroom.int)",
ylab = "Variance",
ylim = c(0, max(var_mat) * 1.25))
legend("topright", inset = c(-0.18, 0),
legend = colnames(var_mat), fill = cols, bty = "n", cex = 0.85)We compute the covariance and correlation between
Pileus.Cap.Width and Stipe.Length across all
eight methods. Note that EJD, GQ, and SPT methods require character
variable names (not numeric indices).
cov_list <- int_cov(mushroom.int, "Pileus.Cap.Width", "Stipe.Length", method)
cor_list <- int_cor(mushroom.int, "Pileus.Cap.Width", "Stipe.Length", method)
# Collect scalar values into named vectors for display and plotting
cov_vec <- sapply(cov_list, function(x) x[1, 1])
cor_vec <- sapply(cor_list, function(x) x[1, 1])
data.frame(Method = names(cov_vec), Covariance = round(cov_vec, 4),
Correlation = round(cor_vec, 4), row.names = NULL)
#> Method Covariance Correlation
#> 1 CM 8.4180 0.8047
#> 2 VM 8.1405 0.3556
#> 3 QM 14.0546 0.8599
#> 4 SE 20.2531 0.8847
#> 5 FV 11.3781 0.7724
#> 6 EJD 8.0520 0.5695
#> 7 GQ 11.4609 0.8106
#> 8 SPT 11.9723 0.8468The SE method yields the largest covariance because it doubles the effective sample by stacking both endpoints, amplifying joint variation. VM produces the lowest correlation (0.36) because the vertex expansion introduces \(2^p\) combinations per observation, many of which are non-informative.
par(mfrow = c(2, 1), mar = c(5, 4, 3, 1), las = 2)
# --- Covariance across eight methods ---
bar_cols <- c("#4E79A7", "#59A14F", "#F28E2B", "#E15759",
"#76B7B2", "#EDC948", "#B07AA1", "#FF9DA7")
bp <- barplot(cov_vec, col = bar_cols, border = NA,
main = "Cov(Pileus.Cap.Width, Stipe.Length) by Method",
ylab = "Covariance",
ylim = c(0, max(cov_vec) * 1.25))
text(bp, cov_vec, labels = round(cov_vec, 2), pos = 3, cex = 0.8)
# --- Correlation across eight methods ---
bp <- barplot(cor_vec, col = bar_cols, border = NA,
main = "Cor(Pileus.Cap.Width, Stipe.Length) by Method",
ylab = "Correlation",
ylim = c(0, 1.15))
text(bp, cor_vec, labels = round(cor_vec, 2), pos = 3, cex = 0.8)
abline(h = 1, lty = 2, col = "grey50")Geometric functions characterize the shape and spatial properties of individual intervals and relationships between interval variables.
data(mushroom.int)
# Width = upper - lower
head(int_width(mushroom.int, "Stipe.Length"))
#> Stipe.Length
#> 1 5
#> 2 10
#> 3 6
#> 4 3
#> 5 3
#> 6 6
# Radius = width / 2
head(int_radius(mushroom.int, "Stipe.Length"))
#> Stipe.Length
#> 1 2.5
#> 2 5.0
#> 3 3.0
#> 4 1.5
#> 5 1.5
#> 6 3.0
# Center = (lower + upper) / 2
head(int_center(mushroom.int, "Stipe.Length"))
#> Stipe.Length
#> 1 6.5
#> 2 9.0
#> 3 8.0
#> 4 5.5
#> 5 3.5
#> 6 7.0
# Midrange
head(int_midrange(mushroom.int, "Stipe.Length"))
#> Stipe.Length
#> 1 2.5
#> 2 5.0
#> 3 3.0
#> 4 1.5
#> 5 1.5
#> 6 3.0These functions measure the degree to which intervals from two variables overlap or contain each other, observation by observation:
# Overlap between two interval variables
head(int_overlap(mushroom.int, "Stipe.Length", "Stipe.Thickness"))
#> Stipe.Length_Stipe.Thickness
#> 1 0.0000000
#> 2 0.0000000
#> 3 0.0000000
#> 4 0.1250000
#> 5 0.1428571
#> 6 0.0000000
# Containment: proportion of var_name2 contained within var_name1
head(int_containment(mushroom.int, "Stipe.Length", "Stipe.Thickness"))
#> Stipe.Length_in_Stipe.Thickness
#> 1 FALSE
#> 2 FALSE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSEdata(mushroom.int)
# Median (default method = "CM")
int_median(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> CM 7
# Quantiles
int_quantile(mushroom.int, "Stipe.Length", probs = c(0.25, 0.5, 0.75))
#> $CM
#> Stipe.Length
#> 25% 4.75
#> 50% 7.00
#> 75% 9.25
# Compare median across methods
int_median(mushroom.int, "Stipe.Length", method = c("CM", "FV"))
#> Stipe.Length
#> CM 7.000000
#> FV 9.405092# Range (max - min)
int_range(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> CM 11.5
# Interquartile range (Q3 - Q1)
int_iqr(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> IQR 4.5
# Median absolute deviation
int_mad(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> CM 2.5
# Mode (histogram-based estimation)
int_mode(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> CM 8.75Robust statistics reduce the influence of outliers by trimming or winsorizing extreme values.
data(mushroom.int)
# Compare standard mean vs trimmed mean (10% trim)
int_mean(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 7.391304
int_trimmed_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#> Stipe.Length
#> CM 7.289474
# Winsorized mean: extreme values are replaced (not removed)
int_winsorized_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#> Stipe.Length
#> CM 7.282609Shape functions characterize the distribution of interval-valued data.
data(mushroom.int)
# Skewness: asymmetry of the distribution
int_skewness(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 0.2228348
# Kurtosis: tail heaviness
int_kurtosis(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM -1.065302
# Symmetry coefficient
int_symmetry(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 0.800247
# Tailedness (related to kurtosis)
int_tailedness(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM -1.065302Similarity functions quantify how alike two interval variables are across all observations. Available measures include Jaccard, Dice, cosine, and overlap coefficient.
data(mushroom.int)
int_jaccard(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#> Stipe.Length_Stipe.Thickness
#> 1 0.0000000
#> 2 0.0000000
#> 3 0.0000000
#> 4 0.1250000
#> 5 0.1428571
#> 6 0.0000000
#> 7 0.0000000
#> 8 0.0000000
#> 9 0.0000000
#> 10 0.0000000
#> 11 0.0000000
#> 12 0.0000000
#> 13 0.0000000
#> 14 0.0000000
#> 15 0.0000000
#> 16 0.0000000
#> 17 0.0000000
#> 18 0.0000000
#> 19 0.0000000
#> 20 0.0000000
#> 21 0.0000000
#> 22 0.0000000
#> 23 0.0000000
int_dice(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#> Stipe.Length_Stipe.Thickness
#> 1 0.0000000
#> 2 0.0000000
#> 3 0.0000000
#> 4 0.2222222
#> 5 0.2500000
#> 6 0.0000000
#> 7 0.0000000
#> 8 0.0000000
#> 9 0.0000000
#> 10 0.0000000
#> 11 0.0000000
#> 12 0.0000000
#> 13 0.0000000
#> 14 0.0000000
#> 15 0.0000000
#> 16 0.0000000
#> 17 0.0000000
#> 18 0.0000000
#> 19 0.0000000
#> 20 0.0000000
#> 21 0.0000000
#> 22 0.0000000
#> 23 0.0000000
int_cosine(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#> Stipe.Length_Stipe.Thickness
#> Cosine 0.9257023
int_overlap_coefficient(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#> Stipe.Length_Stipe.Thickness
#> 1 0.0000000
#> 2 0.0000000
#> 3 0.0000000
#> 4 0.3333333
#> 5 0.5000000
#> 6 0.0000000
#> 7 0.0000000
#> 8 0.0000000
#> 9 0.0000000
#> 10 0.0000000
#> 11 0.0000000
#> 12 0.0000000
#> 13 0.0000000
#> 14 0.0000000
#> 15 0.0000000
#> 16 0.0000000
#> 17 0.0000000
#> 18 0.0000000
#> 19 0.0000000
#> 20 0.0000000
#> 21 0.0000000
#> 22 0.0000000
#> 23 0.0000000Note: int_tanimoto() is equivalent to
int_jaccard() for interval-valued data:
int_tanimoto(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#> Stipe.Length_Stipe.Thickness
#> 1 0.0000000
#> 2 0.0000000
#> 3 0.0000000
#> 4 0.1250000
#> 5 0.1428571
#> 6 0.0000000
#> 7 0.0000000
#> 8 0.0000000
#> 9 0.0000000
#> 10 0.0000000
#> 11 0.0000000
#> 12 0.0000000
#> 13 0.0000000
#> 14 0.0000000
#> 15 0.0000000
#> 16 0.0000000
#> 17 0.0000000
#> 18 0.0000000
#> 19 0.0000000
#> 20 0.0000000
#> 21 0.0000000
#> 22 0.0000000
#> 23 0.0000000The int_similarity_matrix() function computes a pairwise
similarity matrix across all interval variables:
int_similarity_matrix(mushroom.int, method = "jaccard")
#> 1 2 3 4 5 6 7
#> 1 1.0000000 0.37037037 0.62380952 0.26666667 0.32539683 0.40873016 0.41269841
#> 2 0.3703704 1.00000000 0.37254902 0.16984127 0.28611111 0.55416667 0.18894831
#> 3 0.6238095 0.37254902 1.00000000 0.17857143 0.23611111 0.32900433 0.27380952
#> 4 0.2666667 0.16984127 0.17857143 1.00000000 0.11428571 0.33333333 0.29761905
#> 5 0.3253968 0.28611111 0.23611111 0.11428571 1.00000000 0.34166667 0.38333333
#> 6 0.4087302 0.55416667 0.32900433 0.33333333 0.34166667 1.00000000 0.32467532
#> 7 0.4126984 0.18894831 0.27380952 0.29761905 0.38333333 0.32467532 1.00000000
#> 8 0.4206349 0.27727273 0.54166667 0.23333333 0.51587302 0.26190476 0.48809524
#> 9 0.1479076 0.03030303 0.00000000 0.08333333 0.22222222 0.04761905 0.33333333
#> 10 0.2651515 0.06666667 0.28787879 0.00000000 0.17794486 0.02666667 0.10873440
#> 11 0.1111111 0.06060606 0.04166667 0.16666667 0.16666667 0.09523810 0.25000000
#> 12 0.4292929 0.61283422 0.41414141 0.12121212 0.57109557 0.55151515 0.29545455
#> 13 0.7444444 0.37142857 0.86772487 0.24074074 0.27042484 0.42028986 0.32063492
#> 14 0.2303030 0.48888889 0.25555556 0.00000000 0.51851852 0.36666667 0.13333333
#> 15 0.0000000 0.41944444 0.04761905 0.25000000 0.08888889 0.28888889 0.06250000
#> 16 0.1179931 0.01449275 0.00000000 0.03703704 0.22222222 0.02222222 0.27777778
#> 17 0.1066919 0.64848485 0.12222222 0.06666667 0.20238095 0.50108225 0.08888889
#> 18 0.1742424 0.56325758 0.25757576 0.14696970 0.23333333 0.62121212 0.20959596
#> 19 0.2083333 0.35555556 0.40476190 0.04166667 0.35714286 0.30000000 0.16203704
#> 20 0.3809524 0.09090909 0.19444444 0.25000000 0.16666667 0.16849817 0.62962963
#> 21 0.2824074 0.40000000 0.48809524 0.09722222 0.45238095 0.36666667 0.24537037
#> 22 0.3240741 0.48888889 0.56818182 0.08333333 0.39682540 0.31111111 0.23397436
#> 23 0.4047619 0.89583333 0.41025641 0.17539683 0.35555556 0.64444444 0.24475524
#> 8 9 10 11 12 13 14
#> 1 0.42063492 0.14790765 0.26515152 0.11111111 0.42929293 0.74444444 0.2303030
#> 2 0.27727273 0.03030303 0.06666667 0.06060606 0.61283422 0.37142857 0.4888889
#> 3 0.54166667 0.00000000 0.28787879 0.04166667 0.41414141 0.86772487 0.2555556
#> 4 0.23333333 0.08333333 0.00000000 0.16666667 0.12121212 0.24074074 0.0000000
#> 5 0.51587302 0.22222222 0.17794486 0.16666667 0.57109557 0.27042484 0.5185185
#> 6 0.26190476 0.04761905 0.02666667 0.09523810 0.55151515 0.42028986 0.3666667
#> 7 0.48809524 0.33333333 0.10873440 0.25000000 0.29545455 0.32063492 0.1333333
#> 8 1.00000000 0.22222222 0.24814815 0.33333333 0.31818182 0.58241758 0.2222222
#> 9 0.22222222 1.00000000 0.19047619 0.22222222 0.02777778 0.07792208 0.0000000
#> 10 0.24814815 0.19047619 1.00000000 0.03703704 0.05333333 0.31818182 0.0000000
#> 11 0.33333333 0.22222222 0.03703704 1.00000000 0.05555556 0.09523810 0.0000000
#> 12 0.31818182 0.02777778 0.05333333 0.05555556 1.00000000 0.40887132 0.7272727
#> 13 0.58241758 0.07792208 0.31818182 0.09523810 0.40887132 1.00000000 0.2095238
#> 14 0.22222222 0.00000000 0.00000000 0.00000000 0.72727273 0.20952381 1.0000000
#> 15 0.04444444 0.00000000 0.00000000 0.00000000 0.27916667 0.02222222 0.3053613
#> 16 0.14285714 0.86666667 0.25396825 0.14285714 0.01333333 0.05252525 0.0000000
#> 17 0.07142857 0.00000000 0.00000000 0.00000000 0.47323232 0.08211144 0.5634921
#> 18 0.16666667 0.00000000 0.02666667 0.00000000 0.57575758 0.20816864 0.4555556
#> 19 0.26190476 0.00000000 0.00000000 0.00000000 0.46969697 0.33333333 0.5238095
#> 20 0.29166667 0.54166667 0.32196970 0.28703704 0.13461538 0.28174603 0.0000000
#> 21 0.35714286 0.00000000 0.00000000 0.00000000 0.53030303 0.41176471 0.5416667
#> 22 0.52380952 0.00000000 0.16666667 0.00000000 0.54292929 0.52287582 0.5194444
#> 23 0.33282828 0.03030303 0.08965517 0.06060606 0.69277389 0.40740741 0.5277778
#> 15 16 17 18 19 20 21
#> 1 0.00000000 0.11799312 0.10669192 0.17424242 0.20833333 0.38095238 0.28240741
#> 2 0.41944444 0.01449275 0.64848485 0.56325758 0.35555556 0.09090909 0.40000000
#> 3 0.04761905 0.00000000 0.12222222 0.25757576 0.40476190 0.19444444 0.48809524
#> 4 0.25000000 0.03703704 0.06666667 0.14696970 0.04166667 0.25000000 0.09722222
#> 5 0.08888889 0.22222222 0.20238095 0.23333333 0.35714286 0.16666667 0.45238095
#> 6 0.28888889 0.02222222 0.50108225 0.62121212 0.30000000 0.16849817 0.36666667
#> 7 0.06250000 0.27777778 0.08888889 0.20959596 0.16203704 0.62962963 0.24537037
#> 8 0.04444444 0.14285714 0.07142857 0.16666667 0.26190476 0.29166667 0.35714286
#> 9 0.00000000 0.86666667 0.00000000 0.00000000 0.00000000 0.54166667 0.00000000
#> 10 0.00000000 0.25396825 0.00000000 0.02666667 0.00000000 0.32196970 0.00000000
#> 11 0.00000000 0.14285714 0.00000000 0.00000000 0.00000000 0.28703704 0.00000000
#> 12 0.27916667 0.01333333 0.47323232 0.57575758 0.46969697 0.13461538 0.53030303
#> 13 0.02222222 0.05252525 0.08211144 0.20816864 0.33333333 0.28174603 0.41176471
#> 14 0.30536131 0.00000000 0.56349206 0.45555556 0.52380952 0.00000000 0.54166667
#> 15 1.00000000 0.00000000 0.51942502 0.37606838 0.18803419 0.00000000 0.17216117
#> 16 0.00000000 1.00000000 0.00000000 0.00000000 0.00000000 0.48611111 0.00000000
#> 17 0.51942502 0.00000000 1.00000000 0.67195767 0.25925926 0.00000000 0.27635328
#> 18 0.37606838 0.00000000 0.67195767 1.00000000 0.35555556 0.05341880 0.42222222
#> 19 0.18803419 0.00000000 0.25925926 0.35555556 1.00000000 0.03703704 0.88888889
#> 20 0.00000000 0.48611111 0.00000000 0.05341880 0.03703704 1.00000000 0.03703704
#> 21 0.17216117 0.00000000 0.27635328 0.42222222 0.88888889 0.03703704 1.00000000
#> 22 0.27472527 0.00000000 0.36153846 0.50000000 0.58888889 0.02564103 0.70000000
#> 23 0.35277778 0.01449275 0.61991342 0.65353535 0.37777778 0.11313131 0.43333333
#> 22 23
#> 1 0.32407407 0.40476190
#> 2 0.48888889 0.89583333
#> 3 0.56818182 0.41025641
#> 4 0.08333333 0.17539683
#> 5 0.39682540 0.35555556
#> 6 0.31111111 0.64444444
#> 7 0.23397436 0.24475524
#> 8 0.52380952 0.33282828
#> 9 0.00000000 0.03030303
#> 10 0.16666667 0.08965517
#> 11 0.00000000 0.06060606
#> 12 0.54292929 0.69277389
#> 13 0.52287582 0.40740741
#> 14 0.51944444 0.52777778
#> 15 0.27472527 0.35277778
#> 16 0.00000000 0.01449275
#> 17 0.36153846 0.61991342
#> 18 0.50000000 0.65353535
#> 19 0.58888889 0.37777778
#> 20 0.02564103 0.11313131
#> 21 0.70000000 0.43333333
#> 22 1.00000000 0.52222222
#> 23 0.52222222 1.00000000These functions measure the uncertainty, variability, and information content of interval-valued data.
data(mushroom.int)
# Shannon entropy (higher = more uncertainty)
int_entropy(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 3.740953
# Coefficient of variation (SD / mean)
int_cv(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 0.4179793
# Dispersion index
int_dispersion(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 2.608696# Imprecision: based on interval widths
int_imprecision(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> Imprecision 0.7882353
# Granularity: variability in interval sizes
int_granularity(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> Granularity 0.506144
# Uniformity: inverse of granularity (higher = more uniform)
int_uniformity(mushroom.int, "Stipe.Length")
#> Stipe.Length
#> Uniformity 0.6639471
# Normalized information content (between 0 and 1)
int_information_content(mushroom.int, "Stipe.Length", method = "CM")
#> Stipe.Length
#> CM 0.8269928Distance functions compute dissimilarity between observations in interval-valued datasets. Available methods include: euclidean, hausdorff, ichino, de_carvalho, and others.
We use the interval columns of car.int for distance
examples (excluding the character Car column):
data(car.int)
car_num <- car.int[, 2:5]
head(car_num, 3)
#> # A tibble: 3 × 4
#> Price Max_Velocity Accn_Time Cylinder_Capacity
#> <symblc_n> <symblc_n> <symblc_n> <symblc_n>
#> 1 [260.50 : 460.00] [298.00 : 306.00] [4.70 : 5.00] [5,935.00 : 5,935.00]
#> 2 [68.20 : 140.30] [216.00 : 250.00] [6.70 : 9.70] [1,781.00 : 4,172.00]
#> 3 [123.80 : 171.40] [232.00 : 250.00] [5.40 : 10.10] [2,771.00 : 4,172.00]# Euclidean distance between observations
int_dist(car_num, method = "euclidean")
#> 1 2 3 4 5 6 7
#> 2 2970.35864
#> 3 2473.41498 496.95918
#> 4 1849.03463 1121.84802 625.03745
#> 5 1405.70741 1569.15384 1073.25182 456.94909
#> 6 2861.17713 150.18676 399.17381 1017.65776 1456.19073
#> 7 3348.56490 378.47417 875.27157 1500.20546 1946.33817 496.67582
#> 8 2446.96685 528.63346 74.77202 604.37782 1043.31078 416.61899 904.08797The hist_mean, hist_var,
hist_cov, and hist_cor functions compute
descriptive statistics for histogram-valued data (MatH
objects). All four functions support the same four methods:
BG (Bertrand and Goupil, 2000), BD
(Billard and Diday, 2006), B (Billard, 2008), and
L2W (L2 Wasserstein).
We compute the mean and variance of Cholesterol and
Hemoglobin in the BLOOD dataset using all four
methods.
all_methods <- c("BG", "BD", "B", "L2W")
var_names <- c("Cholesterol", "Hemoglobin")
# Compute mean for each variable and method
mean_mat <- sapply(all_methods, function(m) {
sapply(var_names, function(v) hist_mean(BLOOD, v, method = m))
})
rownames(mean_mat) <- var_names
mean_mat
#> BG BD B L2W
#> Cholesterol 180.67696 180.67696 180.67696 180.67696
#> Hemoglobin 12.36253 12.36253 12.36253 12.36252
# Compute variance for each variable and method
var_mat <- sapply(all_methods, function(m) {
sapply(var_names, function(v) hist_var(BLOOD, v, method = m))
})
rownames(var_mat) <- var_names
var_mat
#> BG BD B L2W
#> Cholesterol 1002.3393384 381.9587931 400.0263122 388.1376335
#> Hemoglobin 0.5465906 0.2731987 0.2784401 0.2802215The BG, BD, and B means are identical because they share the same first-order moment definition; only L2W (quantile-based) differs slightly. The variances, however, show large differences: BG is the largest because it includes within-histogram spread, while BD, B, and L2W progressively decrease.
bar_cols <- c("#4E79A7", "#59A14F", "#F28E2B", "#E15759")
par(mfrow = c(2, 2), mar = c(4, 5, 3, 1), las = 1)
# --- Mean: Cholesterol ---
bp <- barplot(mean_mat["Cholesterol", ], col = bar_cols, border = NA,
main = "Mean of Cholesterol", ylab = "Mean",
ylim = c(0, max(mean_mat["Cholesterol", ]) * 1.15))
text(bp, mean_mat["Cholesterol", ],
labels = round(mean_mat["Cholesterol", ], 2), pos = 3, cex = 0.8)
# --- Mean: Hemoglobin ---
bp <- barplot(mean_mat["Hemoglobin", ], col = bar_cols, border = NA,
main = "Mean of Hemoglobin", ylab = "Mean",
ylim = c(0, max(mean_mat["Hemoglobin", ]) * 1.15))
text(bp, mean_mat["Hemoglobin", ],
labels = round(mean_mat["Hemoglobin", ], 2), pos = 3, cex = 0.8)
# --- Variance: Cholesterol ---
bp <- barplot(var_mat["Cholesterol", ], col = bar_cols, border = NA,
main = "Variance of Cholesterol", ylab = "Variance",
ylim = c(0, max(var_mat["Cholesterol", ]) * 1.25))
text(bp, var_mat["Cholesterol", ],
labels = round(var_mat["Cholesterol", ], 1), pos = 3, cex = 0.8)
# --- Variance: Hemoglobin ---
bp <- barplot(var_mat["Hemoglobin", ], col = bar_cols, border = NA,
main = "Variance of Hemoglobin", ylab = "Variance",
ylim = c(0, max(var_mat["Hemoglobin", ]) * 1.25))
text(bp, var_mat["Hemoglobin", ],
labels = round(var_mat["Hemoglobin", ], 4), pos = 3, cex = 0.8)We compute the covariance and correlation between
Cholesterol and Hemoglobin using all four
methods.
cov_vec <- sapply(all_methods, function(m)
hist_cov(BLOOD, "Cholesterol", "Hemoglobin", method = m))
cor_vec <- sapply(all_methods, function(m)
hist_cor(BLOOD, "Cholesterol", "Hemoglobin", method = m))
data.frame(Method = all_methods,
Covariance = round(cov_vec, 4),
Correlation = round(cor_vec, 4),
row.names = NULL)
#> Method Covariance Correlation
#> 1 BG -5.1790 -0.2213
#> 2 BD -5.2660 -0.2250
#> 3 B -4.6927 -0.2005
#> 4 L2W -5.0005 -0.4795All four methods yield a negative association between Cholesterol and Hemoglobin. Following Irpino and Verde (2015, Eqs. 30–32), the BG, BD, and B correlations all use the Bertrand-Goupil standard deviation in the denominator, so their values are similar (around ???0.20 to ???0.22). Only L2W uses its own Wasserstein-based variance, which produces a different correlation.
par(mfrow = c(1, 2), mar = c(4, 5, 3, 1), las = 1)
# --- Covariance ---
bp <- barplot(cov_vec, col = bar_cols, border = NA,
main = "Cov(Cholesterol, Hemoglobin)",
ylab = "Covariance",
ylim = c(min(cov_vec) * 1.35, 0))
text(bp, cov_vec, labels = round(cov_vec, 2), pos = 1, cex = 0.8)
# --- Correlation ---
bp <- barplot(cor_vec, col = bar_cols, border = NA,
main = "Cor(Cholesterol, Hemoglobin)",
ylab = "Correlation",
ylim = c(min(cor_vec) * 1.4, 0))
text(bp, cor_vec, labels = round(cor_vec, 2), pos = 1, cex = 0.8)
abline(h = -1, lty = 2, col = "grey50")This section demonstrates how dataSDA datasets can be
used for benchmarking symbolic data analysis methods across four
analytical tasks: clustering (interval and histogram), classification,
and regression. Five representative datasets are selected for each task,
with no overlap among the interval-data tasks.
The aggregate_to_symbolic() function converts a
classical data frame into interval-valued or histogram-valued symbolic
data via grouping (clustering, resampling, or a categorical
variable).
set.seed(42)
iris_int <- aggregate_to_symbolic(
iris,
type = "int",
group_by = "kmeans",
stratify_var = "Species",
K = 10
)
iris_int
#> # A tibble: 30 × 5
#> cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
#> * <chr> <symblc_n> <symblc_n> <symblc_n> <symblc_n>
#> 1 setosa.cluster_1 [4.90 : 5.20] [3.20 : 3.60] [1.20 : 1.50] [0.10 : 0.30]
#> 2 setosa.cluster_4 [4.70 : 5.00] [3.00 : 3.20] [1.40 : 1.60] [0.10 : 0.30]
#> 3 setosa.cluster_7 [4.30 : 4.70] [2.90 : 3.20] [1.10 : 1.50] [0.10 : 0.20]
#> 4 setosa.cluster_5 [5.40 : 5.70] [3.80 : 3.90] [1.30 : 1.70] [0.30 : 0.40]
#> 5 setosa.cluster_3 [4.60 : 4.60] [3.40 : 3.60] [1.00 : 1.40] [0.20 : 0.30]
#> 6 setosa.cluster_8 [5.20 : 5.50] [3.40 : 3.70] [1.30 : 1.70] [0.20 : 0.40]
#> 7 setosa.cluster_9 [4.80 : 5.10] [3.30 : 3.50] [1.60 : 1.90] [0.20 : 0.60]
#> 8 setosa.cluster_2 [5.50 : 5.80] [4.00 : 4.40] [1.20 : 1.50] [0.20 : 0.40]
#> 9 setosa.cluster_10 [5.10 : 5.20] [3.70 : 4.10] [1.50 : 1.90] [0.10 : 0.40]
#> 10 setosa.cluster_6 [4.50 : 4.50] [2.30 : 2.30] [1.30 : 1.30] [0.30 : 0.30]
#> # ℹ 20 more rowsThe ggInterval package provides specialized plots for
symbolic data including index image plots, PCA biplots, and radar plots.
The following examples require ggInterval to be installed.
Note: with 30 observations the index image and PCA plots may take
several minutes to render.
library(ggInterval)
library(ggplot2)
# Keep only interval columns (drop the 'sample' label column).
# Fix zero-width intervals from singleton clusters.
iris_int_num <- iris_int[, sapply(iris_int, inherits, "symbolic_interval")]
for (v in colnames(iris_int_num)) {
cv <- unclass(iris_int_num[[v]])
w <- Im(cv) - Re(cv)
fix <- which(w == 0)
if (length(fix) > 0) {
cv[fix] <- complex(real = Re(cv[fix]) - 1e-6, imaginary = Im(cv[fix]) + 1e-6)
class(cv) <- c("symbolic_interval", "vctrs_vctr")
iris_int_num[[v]] <- cv
}
}Index image plot – a heatmap of all interval variables:
PCA biplot – principal component analysis for interval data:
#> Call:
#> princomp(x = m[, 1:p])
#>
#> Standard deviations:
#> Comp.1 Comp.2 Comp.3 Comp.4
#> 1.6672324 1.0000839 0.4167622 0.1953058
#>
#> 4 variables and 480 observations.
Radar plot – multivariate comparison of interval
columns from environment.mix (observations 4 and 6):
data(environment.mix)
env_int <- environment.mix[, 5:17]
ggInterval_radarplot(env_int, plotPartial = c(4, 6),
showLegend = FALSE, addText = FALSE)We plot the first 12 months of the irish_wind.its
dataset, showing each station’s wind speed interval as a bar with
midpoint lines.
data(irish_wind.its)
wind_sub <- irish_wind.its[1:12, ]
# Reshape to long format
stations <- c("BIR", "DUB", "KIL", "SHA", "VAL")
wind_long <- do.call(rbind, lapply(stations, function(st) {
data.frame(
month_num = seq_len(12),
Station = st,
lower = wind_sub[[paste0(st, "_l")]],
upper = wind_sub[[paste0(st, "_u")]],
mid = (wind_sub[[paste0(st, "_l")]] + wind_sub[[paste0(st, "_u")]]) / 2
)
}))
wind_long$Station <- factor(wind_long$Station, levels = stations)
# Dodge bars for each station within each month
n_st <- length(stations)
bar_w <- 0.6 / n_st
wind_long$st_idx <- as.numeric(wind_long$Station)
wind_long$x <- wind_long$month_num +
(wind_long$st_idx - (n_st + 1) / 2) * bar_w
ggplot(wind_long) +
geom_rect(aes(xmin = x - bar_w / 2, xmax = x + bar_w / 2,
ymin = lower, ymax = upper, fill = Station),
alpha = 0.4, color = NA) +
geom_line(aes(x = x, y = mid, color = Station, group = Station),
linewidth = 0.5) +
geom_point(aes(x = x, y = mid, color = Station), size = 1) +
scale_x_continuous(breaks = 1:12, labels = month.abb) +
labs(title = "Irish Wind Speed Intervals (1961)",
x = "Month", y = "Wind Speed (knots)") +
theme_grey(base_size = 12)We benchmark three clustering algorithms on five interval-valued datasets using the quality index \(1 - \text{WSS}/\text{TSS}\):
RSDA::sym.kmeans() – K-means for symbolic datasymbolicDA::DClust() – Distance-based symbolic
clusteringsymbolicDA::SClust() – Symbolic clusteringEach method independently determines its own optimal number of clusters \(k\) via an n-adaptive elbow method. For each method, we sweep \(k\) from 2 to \(k_{\max} = \min(n-1,\, 10,\, \max(3,\, \lfloor n/5 \rfloor))\) and compute the quality index at each \(k\). The elbow is detected using an absolute gain threshold \(\tau = \Delta_{\max} / (1 + n/100)\), where \(\Delta_{\max}\) is the largest quality gain across all \(k\). A 2-step lookahead skips temporary dips. This yields a higher threshold (fewer clusters) for small datasets and a lower threshold (more clusters allowed) for large datasets.
library(symbolicDA)
set.seed(123)
datasets_clust_int <- list(
list(name = "face.iGAP", data = "face.iGAP"),
list(name = "prostate.int", data = "prostate.int"),
list(name = "nycflights.int", data = "nycflights.int"),
list(name = "china_temp.int", data = "china_temp.int"),
list(name = "lisbon_air_quality.int", data = "lisbon_air_quality.int")
)
clust_int_results <- do.call(rbind, lapply(datasets_clust_int, function(ds) {
tryCatch({
data(list = ds$data)
x <- get(ds$data)
if (!inherits(x, "symbolic_tbl")) {
x <- tryCatch(int_convert_format(x, to = "RSDA"), error = function(e) x)
for (i in seq_along(x)) {
if (is.complex(x[[i]]) && !inherits(x[[i]], "symbolic_interval"))
class(x[[i]]) <- c("symbolic_interval", "vctrs_vctr")
}
if (!inherits(x, "symbolic_tbl"))
class(x) <- c("symbolic_tbl", class(x))
}
x_int <- .get_interval_cols(x)
n <- nrow(x_int); p <- ncol(x_int)
k_max <- min(n - 1, 10, max(3, floor(n / 5)))
d <- int_dist_matrix(x_int, method = "hausdorff")
so <- simple2SO(.to_3d_array(x_int))
km_qs <- dc_qs <- sc_qs <- setNames(rep(NA_real_, k_max - 1),
as.character(2:k_max))
for (k in 2:k_max) {
set.seed(123)
km_qs[as.character(k)] <- tryCatch({
res <- sym.kmeans(x_int, k = k)
1 - res$tot.withinss / res$totss
}, error = function(e) NA)
set.seed(123)
dc_qs[as.character(k)] <- tryCatch({
cl <- DClust(d, cl = k, iter = 100)
.clust_quality(d, cl)
}, error = function(e) NA)
set.seed(123)
sc_qs[as.character(k)] <- tryCatch({
cl <- SClust(so, cl = k, iter = 100)
.clust_quality(d, cl)
}, error = function(e) NA)
}
km_k <- .find_optimal_k(km_qs, n); km_q <- km_qs[as.character(km_k)]
dc_k <- .find_optimal_k(dc_qs, n); dc_q <- dc_qs[as.character(dc_k)]
sc_k <- .find_optimal_k(sc_qs, n); sc_q <- sc_qs[as.character(sc_k)]
data.frame(Dataset = ds$name, n = n, p = p,
sym.kmeans = sprintf("%.4f (%d)", km_q, km_k),
DClust = sprintf("%.4f (%d)", dc_q, dc_k),
SClust = sprintf("%.4f (%d)", sc_q, sc_k))
}, error = function(e) NULL)
}))kable(clust_int_results, row.names = FALSE,
caption = "Table 4: Interval clustering quality (1 - WSS/TSS) with optimal k in parentheses")| Dataset | n | p | sym.kmeans | DClust | SClust |
|---|---|---|---|---|---|
| face.iGAP | 27 | 6 | 0.6622 (3) | 0.6565 (4) | 0.6730 (4) |
| prostate.int | 97 | 9 | 0.8504 (3) | 0.8066 (5) | 0.8117 (4) |
| nycflights.int | 142 | 4 | 0.9540 (4) | 0.8141 (5) | 0.9061 (4) |
| china_temp.int | 899 | 4 | 0.9058 (5) | 0.8257 (5) | 0.8994 (7) |
| lisbon_air_quality.int | 1096 | 8 | 0.9191 (6) | 0.8944 (6) | 0.8893 (6) |
We benchmark three clustering algorithms on five histogram-valued
datasets from dataSDA. Each dataset is converted from
dataSDA’s histogram string format to HistDAWass::MatH
objects for analysis:
WH_kmeans() – K-means for histogram dataWH_fcmeans() – Fuzzy C-means for histogram dataWH_hclust() – Hierarchical clustering with Wasserstein
distanceThe same n-adaptive elbow method from Section 4.2 is used for each method to independently select its optimal \(k\).
set.seed(123)
datasets_clust_hist <- list(
list(name = "age_pyramids.hist"),
list(name = "ozone.hist"),
list(name = "china_climate_season.hist"),
list(name = "french_agriculture.hist"),
list(name = "flights_detail.hist")
)
clust_hist_results <- do.call(rbind, lapply(datasets_clust_hist, function(ds) {
tryCatch({
data(list = ds$name, package = "dataSDA")
raw <- get(ds$name)
x <- .dataSDA_hist_to_MatH(raw)
n <- nrow(x@M); p <- ncol(x@M)
k_max <- min(n - 1, 10, max(3, floor(n / 5)))
# Precompute Wasserstein distance matrix and hclust tree (shared across k)
dm <- WH_MAT_DIST(x)
set.seed(123)
hc <- WH_hclust(x, simplify = TRUE)
km_qs <- fc_qs <- hc_qs <- setNames(rep(NA_real_, k_max - 1),
as.character(2:k_max))
for (k in 2:k_max) {
set.seed(123)
km_qs[as.character(k)] <- tryCatch({
res <- WH_kmeans(x, k = k)
res$quality
}, error = function(e) NA)
set.seed(123)
fc_qs[as.character(k)] <- tryCatch({
res <- WH_fcmeans(x, k = k)
res$quality
}, error = function(e) NA)
set.seed(123)
hc_qs[as.character(k)] <- tryCatch({
cl <- cutree(hc, k = k)
.clust_quality(dm, cl)
}, error = function(e) NA)
}
km_k <- .find_optimal_k(km_qs, n); km_q <- km_qs[as.character(km_k)]
fc_k <- .find_optimal_k(fc_qs, n); fc_q <- fc_qs[as.character(fc_k)]
hc_k <- .find_optimal_k(hc_qs, n); hc_q <- hc_qs[as.character(hc_k)]
data.frame(Dataset = ds$name, n = n, p = p,
WH_kmeans = sprintf("%.4f (%d)", km_q, km_k),
WH_fcmeans = sprintf("%.4f (%d)", fc_q, fc_k),
WH_hclust = sprintf("%.4f (%d)", hc_q, hc_k))
}, error = function(e) NULL)
}))kable(clust_hist_results, row.names = FALSE,
caption = "Table 5: Histogram clustering quality (1 - WSS/TSS) with optimal k in parentheses")| Dataset | n | p | WH_kmeans | WH_fcmeans | WH_hclust |
|---|---|---|---|---|---|
| age_pyramids.hist | 229 | 3 | 0.8703 (4) | 0.8396 (3) | 0.8623 (5) |
| ozone.hist | 78 | 4 | 0.7571 (3) | 0.7613 (3) | 0.8002 (4) |
| china_climate_season.hist | 60 | 56 | 0.4822 (3) | 0.4405 (3) | 0.5137 (4) |
| french_agriculture.hist | 22 | 4 | 0.7753 (4) | 0.6389 (3) | 0.6371 (3) |
| flights_detail.hist | 16 | 5 | 0.7071 (3) | 0.6992 (3) | 0.6874 (3) |
We benchmark three classifiers on five interval-valued datasets and report resubstitution accuracy:
MAINT.Data::lda() – Linear discriminant analysis for
interval dataMAINT.Data::qda() – Quadratic discriminant analysis for
interval datae1071::svm() – Support vector machine on lower/upper
bound featureslibrary(MAINT.Data)
library(e1071)
datasets_class <- list(
list(name = "cars.int", data = "cars.int",
class_col = "class",
class_desc = "class: Utilitarian(7), Berlina(8), Sportive(8), Luxury(4)"),
list(name = "china_temp.int", data = "china_temp.int",
class_col = "GeoReg",
class_desc = "GeoReg: 6 regions"),
list(name = "mushroom.int", data = "mushroom.int",
class_col = "Edibility",
class_desc = "Edibility: T(4), U(2), Y(17)"),
list(name = "ohtemp.int", data = "ohtemp.int",
class_col = "STATE",
class_desc = "STATE: 10 groups"),
list(name = "wine.int", data = "wine.int",
class_col = "class",
class_desc = "class: 1(21), 2(12)")
)
class_results <- do.call(rbind, lapply(datasets_class, function(ds) {
tryCatch({
data(list = ds$data)
x <- get(ds$data)
grp <- .get_class_labels(x, ds$class_col)
idata <- .build_IData(x)
int_cols <- sapply(x, function(col) inherits(col, "symbolic_interval"))
svm_df <- data.frame(row.names = seq_len(nrow(x)))
for (v in names(x)[int_cols]) {
cv <- unclass(x[[v]])
svm_df[[paste0(v, "_l")]] <- Re(cv)
svm_df[[paste0(v, "_u")]] <- Im(cv)
}
set.seed(123)
lda_acc <- tryCatch({
res <- MAINT.Data::lda(idata, grouping = grp)
pred <- predict(res, idata)
mean(pred$class == grp)
}, error = function(e) NA)
set.seed(123)
qda_acc <- tryCatch({
res <- MAINT.Data::qda(idata, grouping = grp)
pred <- predict(res, idata)
mean(pred$class == grp)
}, error = function(e) NA)
set.seed(123)
svm_acc <- tryCatch({
svm_df$class <- grp
res <- svm(class ~ ., data = svm_df, kernel = "radial")
pred <- predict(res, svm_df)
mean(pred == grp)
}, error = function(e) NA)
data.frame(Dataset = ds$name, Response = ds$class_desc,
LDA = lda_acc, QDA = qda_acc, SVM = svm_acc)
}, error = function(e) NULL)
}))kable(class_results, digits = 4, row.names = FALSE,
caption = "Table 6: Classification accuracy (resubstitution)")| Dataset | Response | LDA | QDA | SVM |
|---|---|---|---|---|
| cars.int | class: Utilitarian(7), Berlina(8), Sportive(8), Luxury(4) | 0.9259 | 0.9259 | 0.7778 |
| china_temp.int | GeoReg: 6 regions | 0.6952 | 0.5495 | 0.8087 |
| mushroom.int | Edibility: T(4), U(2), Y(17) | 0.8261 | 0.6957 | 0.7391 |
| ohtemp.int | STATE: 10 groups | 0.4720 | 0.1801 | 0.4596 |
| wine.int | class: 1(21), 2(12) | 0.9091 | 0.6667 | 0.9697 |
We benchmark five regression methods on five interval-valued datasets and report \(R^2\):
RSDA::sym.lm() – Symbolic linear regression (center
method)RSDA::sym.glm() – LASSO regression via
glmnet (center method)RSDA::sym.rf() – Symbolic random forestRSDA::sym.rt() – Symbolic regression treeRSDA::sym.nnet() – Symbolic neural networkdatasets_reg <- list(
list(name = "abalone.iGAP", data = "abalone.iGAP",
response = "Length", n_x = 6),
list(name = "cardiological.int", data = "cardiological.int",
response = "pulse", n_x = 4),
list(name = "nycflights.int", data = "nycflights.int",
response = "distance", n_x = 3),
list(name = "oils.int", data = "oils.int",
response = "specific_gravity", n_x = 3),
list(name = "prostate.int", data = "prostate.int",
response = "lpsa", n_x = 8)
)
reg_results <- do.call(rbind, lapply(datasets_reg, function(ds) {
tryCatch({
data(list = ds$data)
x <- get(ds$data)
if (!inherits(x, "symbolic_tbl")) {
x2 <- tryCatch(int_convert_format(x, to = "RSDA"), error = function(e) NULL)
if (!is.null(x2)) {
x <- x2
for (i in seq_along(x)) {
if (is.complex(x[[i]]) && !inherits(x[[i]], "symbolic_interval"))
class(x[[i]]) <- c("symbolic_interval", "vctrs_vctr")
}
if (!inherits(x, "symbolic_tbl"))
class(x) <- c("symbolic_tbl", class(x))
} else {
cn <- colnames(x)
l_cols <- grep("_l$", cn, value = TRUE)
vars <- sub("_l$", "", l_cols)
out <- data.frame(row.names = seq_len(nrow(x)))
for (v in vars) {
lv <- x[[paste0(v, "_l")]]; uv <- x[[paste0(v, "_u")]]
si <- complex(real = lv, imaginary = uv)
class(si) <- c("symbolic_interval", "vctrs_vctr")
out[[v]] <- si
}
class(out) <- c("symbolic_tbl", class(out))
x <- out
}
}
x_int <- .get_interval_cols(x)
fml <- as.formula(paste(ds$response, "~ ."))
nc <- data.frame(row.names = seq_len(nrow(x_int)))
for (v in names(x_int)) {
cv <- unclass(x_int[[v]])
nc[[v]] <- (Re(cv) + Im(cv)) / 2
}
actual <- nc[[ds$response]]
resp_idx <- which(names(x_int) == ds$response)
.r2 <- function(a, p) 1 - sum((a - p)^2) / sum((a - mean(a))^2)
set.seed(123)
lm_r2 <- tryCatch({
res <- sym.lm(fml, sym.data = x_int, method = "cm")
summary(res)$r.squared
}, error = function(e) NA)
set.seed(123)
glm_r2 <- tryCatch({
res <- sym.glm(sym.data = x_int, response = resp_idx, method = "cm")
pred <- as.numeric(predict(res, newx = as.matrix(nc[, -resp_idx]),
s = "lambda.min"))
.r2(actual, pred)
}, error = function(e) NA)
set.seed(123)
rf_r2 <- tryCatch({
res <- sym.rf(fml, sym.data = x_int, method = "cm")
tail(res$rsq, 1)
}, error = function(e) NA)
set.seed(123)
rt_r2 <- tryCatch({
res <- sym.rt(fml, sym.data = x_int, method = "cm")
.r2(actual, predict(res))
}, error = function(e) NA)
set.seed(123)
nnet_r2 <- tryCatch({
res <- sym.nnet(fml, sym.data = x_int, method = "cm")
pred_sc <- as.numeric(res$net.result[[1]])
pred <- pred_sc * res$data_c_sds[resp_idx] + res$data_c_means[resp_idx]
.r2(actual, pred)
}, error = function(e) NA)
data.frame(Dataset = ds$name, Response = ds$response, p = ds$n_x,
sym.lm = lm_r2, sym.glm = glm_r2, sym.rf = rf_r2,
sym.rt = rt_r2, sym.nnet = nnet_r2)
}, error = function(e) NULL)
}))| Dataset | Response | p | sym.lm | sym.glm | sym.rf | sym.rt | sym.nnet |
|---|---|---|---|---|---|---|---|
| abalone.iGAP | Length | 6 | 0.9893 | 0.9882 | 0.8976 | 0.6453 | 0.9932 |
| cardiological.int | pulse | 4 | 0.3772 | 0.3682 | 0.8014 | 0.5610 | 0.9996 |
| nycflights.int | distance | 3 | 0.9912 | 0.9910 | 0.9177 | 0.9857 | 0.9930 |
| oils.int | specific_gravity | 3 | 0.9115 | 0.6530 | 0.6577 | 0.0000 | 0.9679 |
| prostate.int | lpsa | 8 | 0.6622 | 0.6616 | 0.5299 | 0.6946 | 0.9879 |
We welcome contributions of high-quality datasets for symbolic data analysis. Submitted datasets will be made publicly available (or under specified constraints) to support research in machine learning, statistics, and related fields. You can submit the related files via email to wuhm@g.nccu.edu.tw or through the Google Form at Symbolic Dataset Submission Form. The submission requirements are as follows.
Dataset Format:
.csv, .xlsx, or any
symbolic format in plain text..zip or .gz) if multiple files
are included.Required Metadata: Contributors must provide the following details:
| Field | Description | Example |
|---|---|---|
| Dataset Name | A clear, descriptive title. | “face recognition data” |
| Dataset Short Name | A clear,abbreviation title. | “face data” |
| Authors | Full names of donator. | “First name, Last name” |
| Contact email. | “abc123@gmail.com” | |
| Institutes | Affiliated organizations. | “-” |
| Country | Origin of the dataset. | “France” |
| Dataset Descriptions | Data descriptive | See ‘README’ |
| Sample Size | Number of instances/rows. | 27 |
| Number of Variables | Total features/columns (categorical/numeric). | 6 (interval) |
| Missing Values | Indicate if missing values exist and how they are handled. | “None” / “Yes, marked as NA” |
| Variable Descriptions | Detailed description of each column (name, type, units, range). | See ‘README’ |
| Source | Original data source (if applicable). | “Leroy et al. (1996)” |
| References | Citations for prior work using the dataset. | “Douzal-Chouakria, Billard, and Diday (2011)” |
| Applied Areas | Relevant fields (e.g., biology, finance). | “Machine Learning” |
| Usage Constraints | Licensing (CC-BY, MIT) or restrictions. | “Public domain” |
| Data Link | URL to download the dataset (Google Drive, GitHub, etc.). | “(https)” |
Quality Assurance:
Optional (Recommended):
README file with:
Po-Wei Chen, Chun-houh Chen, Han-Ming Wu (2026), dataSDA: datasets and basic statistics for symbolic data analysis in R (v0.2.4). Technical report.