Title: | Analysis of Symbolic Data |
---|---|
Description: | Symbolic data analysis methods: importing/exporting data from ASSO XML Files, distance calculation for symbolic data (Ichino-Yaguchi, de Carvalho measure), zoom star plot, 3d interval plot, multidimensional scaling for symbolic interval data, dynamic clustering based on distance matrix, HINoV method for symbolic data, Ichino's feature selection method, principal component analysis for symbolic interval data, decision trees for symbolic data based on optimal split with bagging, boosting and random forest approach (+visualization), kernel discriminant analysis for symbolic data, Kohonen's self-organizing maps for symbolic data, replication and profiling, artificial symbolic data generation. (Milligan, G.W., Cooper, M.C. (1985) <doi:10.1007/BF02294245>, Breiman, L. (1996), <doi:10.1007/BF00058655>, Hubert, L., Arabie, P. (1985), <doi:10.1007%2FBF01908075>, Ichino, M., & Yaguchi, H. (1994), <doi:10.1109/21.286391>, Rand, W.M. (1971) <doi:10.1080/01621459.1971.10482356>, Calinski, T., Harabasz, J. (1974) <doi:10.1080/03610927408827101>, Breckenridge, J.N. (2000) <doi:10.1207/S15327906MBR3502_5>, Groenen, P.J.F, Winsberg, S., Rodriguez, O., Diday, E. (2006) <doi:10.1016/j.csda.2006.04.003>, Walesiak, M., Dudek, A. (2008) <doi:10.1007/978-3-540-78246-9_11>, Dudek, A. (2007), <doi:10.1007/978-3-540-70981-7_4>). |
Authors: | Andrzej Dudek, Marcin Pelka <[email protected]>, Justyna Wilk<[email protected]> (to 2017-09-20), Marek Walesiak <[email protected]> (from 2018-02-01) |
Maintainer: | Andrzej Dudek <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.7-1 |
Built: | 2025-03-08 03:47:00 UTC |
Source: | https://github.com/cran/symbolicDA |
Bagging algorithm for optimal split based on decision (classification) tree for symbolic objects
bagging.SDA(sdt,formula,testSet, mfinal=20,rf=FALSE,...)
bagging.SDA(sdt,formula,testSet, mfinal=20,rf=FALSE,...)
sdt |
Symbolic data table |
formula |
formula as in ln function |
testSet |
a vector of integers indicating classes to which each objects are allocated in learnig set |
mfinal |
number of partial models generated |
rf |
random forest like drawing of variables in partial models |
... |
arguments passed to decisionTree.SDA function |
The bagging, which stands for bootstrap aggregating, was introduced by Breiman in 1996. The diversity of classifiers in bagging is obtained by using bootstrapped replicas of the training data. Different training data subsets are randomly drawn with replacement from the entire training data set. Then each training data subset is used to train a decision tree (classifier). Individual classifiers are then combined by taking a simple majority vote of their decisions. For any given instance, the class chosen by most number of classifiers is the ensemble decision.
An object of class bagging.SDA, which is a list with the following components:
predclass |
the class predicted by the ensemble classifier |
confusion |
the confusion matrix for ensemble classifier |
error |
the classification error |
pred |
? |
classfinal |
final class memberships |
Andrzej Dudek [email protected] Marcin Pełka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Breiman L. (1996), Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140. Available at: doi:10.1007/BF00058655.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
boosting.SDA
,random.forest.SDA
,decisionTree.SDA
#Example will be available in next version of package, thank You for your patience :-)
#Example will be available in next version of package, thank You for your patience :-)
Boosting algorithm for optimal split based decision tree for symbolic objects, "symbolic" version of adabag.M1 algorithm
boosting.SDA(sdt,formula,testSet, mfinal = 20,...)
boosting.SDA(sdt,formula,testSet, mfinal = 20,...)
sdt |
Symbolic data table |
formula |
formula as in ln function |
testSet |
a vector of integers indicating classes to which each objects are allocated in learnig set |
mfinal |
number of partial models generated |
... |
arguments passed to decisionTree.SDA function |
Boosting, similar to bagging, also creates an ensemble of classifiers by resampling the data. The results are then combined by majority voting. Resampling in boosting provides the most informative training data for each consecutive classifier. In each iteration of boosting three weak classifiers are created: the first classifier C1 is trained with a random subset of the training data. The training data subset for the next classifier C2 is chosen as the most informative subset, given C1.C2 is trained on a training data only half of wich is correctly classified by C1 and the other half is misclassified. The third classifier C3 is trained with instances on which C1 and C2 disagree. Then the three classifiers are combined through a three-way majority vote.
formula |
a symbolic description of the model that was used |
trees |
trees built whlie making the ensemble |
weights |
weights for each object from test set |
votes |
final consensus clustering |
class |
predicted class memberships |
error |
error rate of the ensemble clustering |
Andrzej Dudek [email protected] Marcin Pełka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
bagging.SDA
,random.forest.SDA
,decisionTree.SDA
#Example will be available in next version of package, thank You for your patience :-)
#Example will be available in next version of package, thank You for your patience :-)
symbolic data set: 30 observations on 12 symbolic variables - 9 interval-valued and 3 multinominal variables, third dimension represents the begining and the end of intervals for interval-valued variable's implementation or a set of categories for multinominal variable's implementation
symbolic data table (see (link{symbolic.object}
)
the original data on 30 selected car models and their prices, chasis and engine types were collected from the websites of authorized car dealers. Then the data were converted (aggregated) to symbolic format (second order symbolic objects). Each symbolic object - e.g. "Seat Leon”, "Citroen C4" - represents all chasis, engine types and price range of this kind of car model available on the Polish market in 2010. For example the price range [54,900; 96,190] PLN, hatchback and saloon body style, petrol and diesel engine, acceleration 0-100 kph range [10.00; 11.90] seconds are, in general, the characteristics of "Toyota Corolla".
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #r<- HINoV.SDA(sdt, u=5, distance="U_3") #print(r$stopri) #plot(r$stopri[,2], xlab="Variable number", ylab="topri", #xaxt="n", type="b") #axis(1,at=c(1:max(r$stopri[,1])),labels=r$stopri[,1])
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #r<- HINoV.SDA(sdt, u=5, distance="U_3") #print(r$stopri) #plot(r$stopri[,2], xlab="Variable number", ylab="topri", #xaxt="n", type="b") #axis(1,at=c(1:max(r$stopri[,1])),labels=r$stopri[,1])
description of clusters of symbolic objects is obtained by a generalisation operation using in most cases descriptive statistics calculated separately for each cluster and each symbolic variable.
cluster.Description.SDA(table.Symbolic, clusters, precission=3)
cluster.Description.SDA(table.Symbolic, clusters, precission=3)
table.Symbolic |
Symbolic data table |
clusters |
a vector of integers indicating the cluster to which each object is allocated |
precission |
Number of digits to round the results |
A List of cluster numbers, variable number and labels.
The description of clusters of symbolic objects which differs according to the symbolic variable type:
- for interval-valued variable:
"min value" - minimum value of the lower-bounds of intervals observed for objects belonging to the cluster
"max value" - maximum value of the upper-bounds of intervals observed for objects belonging to the cluster
- for multinominal variable:
"categories" - list of all categories of the variable observed for symbolic belonging to the cluster
- for multinominal with weights variable:
"min probabilities" - minimum weight of each category of the variable observed for objects belonging to the cluster
"max probabilities" - maximum weight of each category of the variable observed for objects belonging to the cluster
"avg probabilities" - average weight of each category of the variable calculated for objects belonging to the cluster
"sum probabilities" - sum of weights of each category of the variable calculated for objects belonging to the cluster
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard, L., Diday, E. (eds.) (2006), Symbolic Data Analysis. Conceptual Statistics and Data Mining, Wiley, Chichester.
Verde, R., Lechevallier, Y., Chavent, M. (2003), Symbolic clustering interpretation and visualization, "The Electronic Journal of Symbolic Data Analysis", Vol. 1, No 1.
Bock, H.H., Diday, E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
SClust
,DClust
; hclust
in stats
library; pam
in cluster
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #y<-cars #cl<-SClust(y, 4, iter=150) #print(cl) #o<-cluster.Description.SDA(y, cl) #print(o)
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #y<-cars #cl<-SClust(y, 4, iter=150) #print(cl) #o<-cluster.Description.SDA(y, cl) #print(o)
Artificially generated symbolic interval data
3-dimensional array: 125 objects, 6 variables, third dimension represents begining and end of interval, 5-class structure
Artificially generated data
Dynamical clustering of objects described by symbolic and/or classic (metric, non-metric) variables based on distance matrix
DClust(dist, cl, iter=100)
DClust(dist, cl, iter=100)
dist |
distance matrix |
cl |
number of clusters or vector with initial prototypes of clusters |
iter |
maximum number of iterations |
See file ../doc/DClust_details.pdf for further details
a vector of integers indicating the cluster to which each object is allocated
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Bock, H.H., Diday, E. (eds.) (2000), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester, pp. 191-204.
Diday, E. (1971), La methode des Nuees dynamiques, Revue de Statistique Appliquee, Vol. 19-2, pp. 19-34.
Celeux, G., Diday, E., Govaert, G., Lechevallier, Y., Ralambondrainy, H. (1988), Classifcation Automatique des Donnees, Environnement Statistique et Informatique - Dunod, Gauthier-Villards, Paris.
SClust
, dist_SDA
; dist
in stats
library; dist.GDM
in clusterSim
library; pam
in cluster
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #dist<-dist_SDA(sdt, type="U_3") #clust<-DClust(dist, cl=5, iter=100) #print(clust)
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #dist<-dist_SDA(sdt, type="U_3") #clust<-DClust(dist, cl=5, iter=100) #print(clust)
Optimal split based decision tree for symbolic objects
decisionTree.SDA(sdt,formula,testSet,treshMin=0.0001,treshW=-1e10, tNodes=NULL,minSize=2,epsilon=1e-4,useEM=FALSE, multiNominalType="ordinal",rf=FALSE,rf.size,objectSelection)
decisionTree.SDA(sdt,formula,testSet,treshMin=0.0001,treshW=-1e10, tNodes=NULL,minSize=2,epsilon=1e-4,useEM=FALSE, multiNominalType="ordinal",rf=FALSE,rf.size,objectSelection)
sdt |
Symbolic data table |
formula |
formula as in ln function |
testSet |
a vector of integers indicating classes to which each objects are allocated in learnig set |
treshMin |
parameter for tree creation algorithm |
treshW |
parameter for tree creation algorithm |
tNodes |
parameter for tree creation algorithm |
minSize |
parameter for tree creation algorithm |
epsilon |
parameter for tree creation algorithm |
useEM |
use Expectation Optimalization algorithm for estinating conditional probabilities |
multiNominalType |
"ordinal" - functione treats multi-nominal data as ordered or "nominal" functione treats multi-nomianal data as unordered (longer perfomance times) |
rf |
if TRUE symbolic variables for tree creation are randomly chosen like in random forest algorithm |
rf.size |
the number of variables chosen for tree creation if rf is true |
objectSelection |
optional, vector with symbolic object numbers for tree creation |
For futher details see ../doc/decisionTree_SDA.pdf
nodes |
nodes in tree |
nodeObjects |
contribution of each objects nodes in tree |
conditionalProbab |
conditional probability of belonginess of nodes te classes |
prediction |
predicted classes for objects from testSet |
Andrzej Dudek [email protected] Marcin Pelka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
bagging.SDA
,boosting.SDA
,random.forest.SDA
,draw.decisionTree.SDA
# Example 1 # LONG RUNNING - UNCOMMENT TO RUN # File samochody.xml needed in this example # can be found in /inst/xml library of package #sda<-parse.SO("samochody") #tree<-decisionTree.SDA(sda, "Typ_samochodu~.", testSet=1:33) #summary(tree) # a very gerneral information #tree # summary information
# Example 1 # LONG RUNNING - UNCOMMENT TO RUN # File samochody.xml needed in this example # can be found in /inst/xml library of package #sda<-parse.SO("samochody") #tree<-decisionTree.SDA(sda, "Typ_samochodu~.", testSet=1:33) #summary(tree) # a very gerneral information #tree # summary information
calculates distances between symbolic objects described by interval-valued, multinominal and multinominal with weights variables
dist_SDA(table.Symbolic,type="U_2",subType=NULL,gamma=0.5,power=2,probType="J", probAggregation="P_1",s=0.5,p=2,variableSelection=NULL,weights=NULL)
dist_SDA(table.Symbolic,type="U_2",subType=NULL,gamma=0.5,power=2,probType="J", probAggregation="P_1",s=0.5,p=2,variableSelection=NULL,weights=NULL)
table.Symbolic |
symbolic data table |
type |
distance measure for boolean symbolic objects: H, U_2, U_3, U_4, C_1, SO_1, SO_2, SO_3, SO_4, SO_5; mixed symbolic objects: L_1, L_2 |
subType |
comparison function for C_1 and SO_1: D_1, D_2, D_3, D_4, D_5 |
gamma |
gamma parameter for U_2 and U_3, gamma [0, 0.5] |
power |
power parameter for U_2 and U_3; power [1, 2, 3, ..] |
probType |
distance measure for probabilistic symbolic objects: J, CHI, REN, CHER, LP |
probAggregation |
agregation function for J, CHI, REN, CHER, LP: P_1, P_2 |
s |
parameter for Renyi (REN) and Chernoff (CHE) distance, s [0, 1) |
p |
parameter for Minkowski (LP) metric; p=1 - manhattan distance, p=2 - euclidean distance |
variableSelection |
numbers of variables used for calculation or NULL for all variables |
weights |
weights of variables for Minkowski (LP) metrics |
Distance measures for boolean symbolic objects:
H - Hausdorff's distance for objects described by interval-valued variables, U_2, U_3, U_4 - Ichino-Yaguchi's distance measures for objects described by interval-valued and/or multinominal variables, C_1, SO_1, SO_2, SO_3, SO_4, SO_5 - de Carvalho's distance measures for objects described by interval-valued and/or multinominal variables.
Distance measurement for probabilistic symbolic objects consists of two steps: 1. Calculation of distance between objects for each variable using componentwise distance measures: J (Kullback-Leibler divergence), CHI (Chi-2 divergence), REN (Renyi's divergence), CHER (Chernoff's distance), LP (modified Minkowski metrics). 2. Calculation of aggregative distance between objects based on componentwise distance measures using objectwise distance measure: P_1 (manhattan distance), P_2 (euclidean distance).
Distance measures for mixed symbolic objects - modified Minkowski metrics: L_1 (manhattan distance), L_2 (euclidean distance).
See file ../doc/dist_SDA.pdf for further details
NOTE !!!: In previous version of package this functian has been called dist.SDA.
distance matrix of symbolic objects
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of Symbolic Data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
Ichino, M., & Yaguchi, H. (1994),Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Transactions on Systems, Man, and Cybernetics, 24(4), 698-708. Available at: doi:10.1109/21.286391.
Malerba D., Espozito F, Giovalle V., Tamma V. (2001), Comparing Dissimilarity Measures for Symbolic Data Analysis, "New Techniques and Technologies for Statistcs" (ETK NTTS'01), pp. 473-481.
Malerba, D., Esposito, F., Monopoli, M. (2002), Comparing dissimilarity measures for probabilistic symbolic objects, In: A. Zanasi, C.A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.), Data Mining III, "Series Management Information Systems", Vol. 6, WIT Press, Southampton, pp. 31-40.
DClust
, index.G1d
; dist.Symbolic
in clusterSim
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #dist<-dist_SDA(cars, type="U_3", gamma=0.3, power=2) #print(dist)
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #dist<-dist_SDA(cars, type="U_3", gamma=0.3, power=2) #print(dist)
Draws optimal split based decision tree for symbolic objects
draw.decisionTree.SDA(decisionTree.SDA,boxWidth=1,boxHeight=3)
draw.decisionTree.SDA(decisionTree.SDA,boxWidth=1,boxHeight=3)
decisionTree.SDA |
optimal split based decision tree for symbolic objects (result of |
boxWidth |
witdh of single box in drawing |
boxHeight |
height of single box in drawing |
Draws optimal split based decision (classification) tree for symbolic objects.
A draw of optimal split based decision (classification) tree for symbolic objects.
Andrzej Dudek [email protected] Marcin Pełka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
# LONG RUNNING - UNCOMMENT TO RUN # Files samochody.xml and wave.xml needed in this example # can be found in /inst/xml library of package # Example 1 #sda<-parse.SO("samochody") #tree<-decisionTree.SDA(sda, "Typ_samochodu~.", testSet=26:33) #draw.decisionTree.SDA(tree,boxWidth=1,boxHeight=3) # Example 2 #sda<-parse.SO("wave") #tree<-decisionTree.SDA(sda, "WaveForm~.", testSet=1:30) #draw.decisionTree.SDA(tree,boxWidth=2,boxHeight=3)
# LONG RUNNING - UNCOMMENT TO RUN # Files samochody.xml and wave.xml needed in this example # can be found in /inst/xml library of package # Example 1 #sda<-parse.SO("samochody") #tree<-decisionTree.SDA(sda, "Typ_samochodu~.", testSet=26:33) #draw.decisionTree.SDA(tree,boxWidth=1,boxHeight=3) # Example 2 #sda<-parse.SO("wave") #tree<-decisionTree.SDA(sda, "WaveForm~.", testSet=1:30) #draw.decisionTree.SDA(tree,boxWidth=2,boxHeight=3)
generation of artifficial symbolic data table with given cluster structure
generate.SO(numObjects,numClusters,numIntervalVariables,numMultivaluedVariables)
generate.SO(numObjects,numClusters,numIntervalVariables,numMultivaluedVariables)
numObjects |
number of objects in each cluster |
numClusters |
number of objects |
numIntervalVariables |
Number of symbolic interval variables in generated data table |
numMultivaluedVariables |
Number of symbolic multi-valued variables in generated data table |
data |
symbolic data table with given cluster structure |
clusters |
vector with cluster numbers for each object |
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
User manual for SODAS 2 software, Software Report, Analysis System of Symbolic Official Data, Project no. IST-2000-25161, Paris.
see symbolic.object
for symbolic data table R structure representation
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Carmone, Kara and Maxwell's Heuristic Identification of Noisy Variables (HINoV) method for symbolic data
HINoV.SDA(table.Symbolic, u=NULL, distance="H", Index="cRAND",method="pam",...)
HINoV.SDA(table.Symbolic, u=NULL, distance="H", Index="cRAND",method="pam",...)
table.Symbolic |
symbolic data table |
u |
number of clusters |
distance |
symbolic distance measure as parameter type in |
method |
clustering method: "single", "ward", "complete", "average", "mcquitty", "median", "centroid", "pam" (default), "SClust", "DClust" |
Index |
"cRAND" - adjusted Rand index (default); "RAND" - Rand index |
... |
additional argument passed to |
For HINoV in symbolic data analysis there can be used methods based on distance matrix such as hierarchical ("single", "ward", "complete", "average", "mcquitty", "median", "centroid") and optimization methods ("pam", "DClust") and also methods based on symbolic data table ("SClust").
See file ../doc/HINoVSDA_details.pdf for further details
parim |
m x m symmetric matrix (m - number of variables). Matrix contains pairwise adjusted Rand (or Rand) indices for partitions formed by the j-th variable with partitions formed by the l-th variable |
topri |
sum of rows of |
stopri |
ranked values of |
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Bock, H.H., Diday, E. (eds.) (2000), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
Carmone, F.J., Kara, A., Maxwell, S. (1999), HINoV: a new method to improve market segment definition by identifying noisy variables, "Journal of Marketing Research", November, vol. 36, 501-509.
Hubert, L.J., Arabie, P. (1985), Comparing partitions, "Journal of Classification", no. 1, 193-218. Available at: doi:10.1007/BF01908075.
Rand, W.M. (1971), Objective criteria for the evaluation of clustering methods, "Journal of the American Statistical Association", no. 336, 846-850. Available at: doi:10.1080/01621459.1971.10482356.
Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92. Available at: doi:1007/978-3-540-78246-9_11
DClust
, SClust
, dist_SDA
; HINoV.Symbolic
, dist.Symbolic
in clusterSim
library; hclust
in stats
library; pam
in cluster
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #r<- HINoV.SDA(cars, u=3, distance="U_2") #print(r$stopri) #plot(r$stopri[,2], xlab="Variable number", ylab="topri", #xaxt="n", type="b") #axis(1,at=c(1:max(r$stopri[,1])),labels=r$stopri[,1])
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #r<- HINoV.SDA(cars, u=3, distance="U_2") #print(r$stopri) #plot(r$stopri[,2], xlab="Variable number", ylab="topri", #xaxt="n", type="b") #axis(1,at=c(1:max(r$stopri[,1])),labels=r$stopri[,1])
Ichino's method for identifiyng non-noisy variables in symbolic data set
IchinoFS.SDA(table.Symbolic)
IchinoFS.SDA(table.Symbolic)
table.Symbolic |
symbolic data table |
See file ../doc/IchinoFSSDA_details.pdf for further details
plot |
plot of the gradient illustrating combinations of variables, in which the axis of ordinates (Y) represents the maximum number of mutual neighbor pairs and the axis of the abscissae (X) corresponds to the number of features (m) |
combination |
the best combination of variables, i.e. the combination most differentiating the set of objects |
maximum results |
step-by-step combinations of variables up to m variables |
calculation results |
.............. |
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Ichino, M. (1994), Feature selection for symbolic data classification, In: E. Diday, Y. Lechevallier, P.B. Schader, B. Burtschy (Eds.), New Approaches in Classification and data analysis, Springer-Verlag, pp. 423-429.
Bock, H.H., Diday, E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
HINoV.SDA
; HINoV.Symbolic
in clusterSim
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #ichino<-IchinoFS.SDA(sdt) #print(ichino)
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #ichino<-IchinoFS.SDA(sdt) #print(ichino)
Calculates Calinski-Harabasz pseudo F-statistic based on distance matrix
index.G1d (d,cl)
index.G1d (d,cl)
d |
distance matrix (see |
cl |
a vector of integers indicating the cluster to which each object is allocated |
See file ../doc/indexG1d_details.pdf for further details
value of Calinski-Harabasz pseudo F-statistic based on distance matrix
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Calinski, T., Harabasz, J. (1974), A dendrite method for cluster analysis, "Communications in Statistics", vol. 3, 1-27. Available at: doi:/10.1080/03610927408827101.
Everitt, B.S., Landau, E., Leese, M. (2001), Cluster analysis, Arnold, London, p. 103. ISBN 9780340761199.
Gordon, A.D. (1999), Classification, Chapman & Hall/CRC, London, p. 62. ISBN 9781584880134.
Milligan, G.W., Cooper, M.C. (1985), An examination of procedures of determining the number of cluster in a data set, "Psychometrika", vol. 50, no. 2, 159-179. Available at: doi:10.1007/BF02294245.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester, pp. 236-262.
Dudek, A. (2007), Cluster Quality Indexes for Symbolic Classification. An Examination, In: H.H.-J. Lenz, R. Decker (Eds.), Advances in Data Analysis, Springer-Verlag, Berlin, pp. 31-38. Available at: doi:10.1007/978-3-540-70981-7_4.
DClust
, SClust
; index.G2
, index.G3
, index.S
, index.H
,index.KL
,index.Gap
, index.DB
in clusterSim
library
# LONG RUNNING - UNCOMMENT TO RUN # Example 1 #library(stats) #data("cars",package="symbolicDA") #x<-cars #d<-dist_SDA(x, type="U_2") #wynik<-hclust(d, method="ward", members=NULL) #clusters<-cutree(wynik, 4) #G1d<-index.G1d(d, clusters) #print(G1d) # Example 2 #data("cars",package="symbolicDA") #md <- dist_SDA(cars, type="U_3", gamma=0.5, power=2) # nc - number_of_clusters #min_nc=2 #max_nc=10 #res <- array(0,c(max_nc-min_nc+1,2)) #res[,1] <- min_nc:max_nc #clusters <- NULL #for (nc in min_nc:max_nc) #{ #cl2 <- pam(md, nc, diss=TRUE) #res[nc-min_nc+1,2] <- G1d <- index.G1d(md,cl2$clustering) #clusters <- rbind(clusters, cl2$clustering) #} #print(paste("max G1d for",(min_nc:max_nc)[which.max(res[,2])],"clusters=",max(res[,2]))) #print("clustering for max G1d") #print(clusters[which.max(res[,2]),]) #write.table(res,file="G1d_res.csv",sep=";",dec=",",row.names=TRUE,col.names=FALSE) #plot(res, type="p", pch=0, xlab="Number of clusters", ylab="G1d", xaxt="n") #axis(1, c(min_nc:max_nc))
# LONG RUNNING - UNCOMMENT TO RUN # Example 1 #library(stats) #data("cars",package="symbolicDA") #x<-cars #d<-dist_SDA(x, type="U_2") #wynik<-hclust(d, method="ward", members=NULL) #clusters<-cutree(wynik, 4) #G1d<-index.G1d(d, clusters) #print(G1d) # Example 2 #data("cars",package="symbolicDA") #md <- dist_SDA(cars, type="U_3", gamma=0.5, power=2) # nc - number_of_clusters #min_nc=2 #max_nc=10 #res <- array(0,c(max_nc-min_nc+1,2)) #res[,1] <- min_nc:max_nc #clusters <- NULL #for (nc in min_nc:max_nc) #{ #cl2 <- pam(md, nc, diss=TRUE) #res[nc-min_nc+1,2] <- G1d <- index.G1d(md,cl2$clustering) #clusters <- rbind(clusters, cl2$clustering) #} #print(paste("max G1d for",(min_nc:max_nc)[which.max(res[,2])],"clusters=",max(res[,2]))) #print("clustering for max G1d") #print(clusters[which.max(res[,2]),]) #write.table(res,file="G1d_res.csv",sep=";",dec=",",row.names=TRUE,col.names=FALSE) #plot(res, type="p", pch=0, xlab="Number of clusters", ylab="G1d", xaxt="n") #axis(1, c(min_nc:max_nc))
Multidimensional scaling for symbolic interval data - InterScal algorithm
interscal.SDA(x,d=2,calculateDist=FALSE)
interscal.SDA(x,d=2,calculateDist=FALSE)
x |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
d |
Dimensionality of reduced space |
calculateDist |
if TRUE x are treated as raw data and min-max dist matrix is calulated. See details |
Interscal is the adaptation of well-known classical multidimensional scaling for symbolic data. The input for Interscal is the interval-valued dissmilirarity matrix. Such dissmilarity matrix can be obtained from symbolic data matrix (that contains only interval-valued variables), judgements obtained from experts, respondents. See Lechevallier Y. (2001) for details on calculating interval-valued distance. See file ../doc/Symbolic_MDS.pdf for further details
xprim |
coordinates of rectangles |
stress.sym |
final STRESSSym value |
Andrzej Dudek [email protected] Marcin Pełka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
Lechevallier Y. (ed.), Scientific report for unsupervised classification, validation and cluster analysis, Analysis System of Symbolic Official Data - Project Number IST-2000-25161, project report.
# LONG RUNNING - UNCOMMENT TO RUN #sda<-parse.SO("samochody") #data<-sda$indivIC #mds<-interscal.SDA(data, d=2, calculateDist=TRUE)
# LONG RUNNING - UNCOMMENT TO RUN #sda<-parse.SO("samochody") #data<-sda$indivIC #mds<-interscal.SDA(data, d=2, calculateDist=TRUE)
Multidimensional scaling for symbolic interval data - IScal algorithm
iscal.SDA(x,d=2,calculateDist=FALSE)
iscal.SDA(x,d=2,calculateDist=FALSE)
x |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
d |
Dimensionality of reduced space |
calculateDist |
if TRUE x are treated as raw data and min-max dist matrix is calulated. See details |
IScal, which was proposed by Groenen et. al. (2006), is an adaptation of well-known nonmetric multidimensional scaling for symbolic data. It is an iterative algorithm that uses I-STRESS objective function. This function is normalized within the range [0; 1] and can be interpreted like classical STRESS values. IScal, like Interscal and SymScal, requires interval-valued dissimilarity matrix. Such dissmilarity matrix can be obtained from symbolic data matrix (that contains only interval-valued variables), judgements obtained from experts, respondents. See Lechevallier Y. (2001) for details on calculating interval-valued distance. See file ../doc/Symbolic_MDS.pdf for further details
xprim |
coordinates of rectangles |
STRESSSym |
final STRESSSym value |
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (red.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (red.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
Groenen P.J.F, Winsberg S., Rodriguez O., Diday E. (2006), I-Scal: multidimensional scaling of interval dissimilarities, Computational Statistics and Data Analysis, 51, pp. 360-378. Available at: doi:10.1016/j.csda.2006.04.003.
Lechevallier Y. (ed.), Scientific report for unsupervised classification, validation and cluster analysis, Analysis System of Symbolic Official Data - Project Number IST-2000-25161, project report.
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Kernel discriminant analysis for symbolic data
kernel.SDA(sdt,formula,testSet,h,...)
kernel.SDA(sdt,formula,testSet,h,...)
sdt |
symbolic data table |
formula |
a formula, as in the |
testSet |
vector with numbers objects ij test set |
h |
kernel bandwith size |
... |
argumets passed to dist_SDA functon |
Kernel discriminant analysis for symbolic data is based on the intensity estimatior (that is based on dissimiliarity measure for symbolic data) due to the fact that classical well-known density estimator can not be applied. Density estimator can not be applied due to the fact that symbolic objects are not object of euclidean space and the integral operator for symbolic data is not applicable.
For futher details see ../doc/Kernel_SDA.pdf.pdf
vector of class belongines of each object in test set
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
# Example 1 # LONG RUNNING - UNCOMMENT TO RUN #sda<-parse.SO("samochody") #model<-kernel.SDA(sda, "Typ_samochodu~.", testSet=6:16, h=0.75) #print(model)
# Example 1 # LONG RUNNING - UNCOMMENT TO RUN #sda<-parse.SO("samochody") #model<-kernel.SDA(sda, "Typ_samochodu~.", testSet=6:16, h=0.75) #print(model)
Kohonen's self-organizing maps for a set of symbolic objects described by interval-valued variables
kohonen.SDA(data, rlen=100, alpha=c(0.05,0.01))
kohonen.SDA(data, rlen=100, alpha=c(0.05,0.01))
data |
symbolic data table in simple form (see |
rlen |
number of iterations (the number of times the complete data set will be presented to the network) |
alpha |
learning rate, determining the size of the adjustments during training. Default is to decline linearly from 0.05 to 0.01 over rlen updates |
See file ../doc/kohonenSDA_details.pdf for further details
clas |
vector of mini-class belonginers in a test set |
prot |
prototypes |
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Kohonen, T. (1995), Self-Organizing Maps, Springer, Berlin-Heidelberg.
Bock, H.H. (2001), Clustering Algorithms and Kohonen Maps for Symbolic Data, International Conference on New Trends in Computational Statistics with Biomedical Applications, ICNCB Proceedings, Osaka, pp. 203-215.
Bock, H.H., Diday, E. (eds.) (2000), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester, pp. 373-392.
SO2Simple
; som
in kohonen
library
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Kohonen self organizing maps for sympbolic data with interval variables
parse.SO(file)
parse.SO(file)
file |
file name without xml extension |
see symbolic.object
for symbolic data table R structure representation
Symbolic data table parsed from XML file
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/clusterSim/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
#cars<-parse.SO("cars")
#cars<-parse.SO("cars")
principal component analysis for symbolic objects described by symbolic interavl variables. Centers algorithm
PCA.centers.SDA(t,pc.number=2)
PCA.centers.SDA(t,pc.number=2)
t |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
pc.number |
number of principal components |
See file ../doc/PCA_SDA.pdf for further details
Data in reduced space (symbolic interval data: a 3-dimensional table)
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
PCA.mrpca.SDA
,
PCA.spaghetti.SDA
,
PCA.spca.SDA
,
PCA.vertices.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
principal component analysis for symbolic objects described by symbolic interavl variables. Midpoints and radii algorithm
PCA.mrpca.SDA(t,pc.number=2)
PCA.mrpca.SDA(t,pc.number=2)
t |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
pc.number |
number of principal components |
See file ../doc/PCA_SDA.pdf for further details
Data in reduced space (symbolic interval data: a 3-dimensional table)
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
PCA.centers.SDA
,
PCA.spaghetti.SDA
,
PCA.spca.SDA
,
PCA.vertices.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
principal component analysis for symbolic objects described by symbolic interavl variables. Spaghetti algorithm
PCA.spaghetti.SDA(t,pc.number=2)
PCA.spaghetti.SDA(t,pc.number=2)
t |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
pc.number |
number of principal components |
See file ../doc/PCA_SDA.pdf for further details
Data in reduced space (symbolic interval data: a 3-dimensional table)
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
PCA.centers.SDA
,
PCA.mrpca.SDA
,
PCA.spca.SDA
,
PCA.vertices.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
principal component analysis for symbolic objects described by symbolic interavl variables. 'Symbolic' PCA algorithm
PCA.spca.SDA(t,pc.number=2)
PCA.spca.SDA(t,pc.number=2)
t |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
pc.number |
number of principal components |
See file ../doc/PCA_SDA.pdf for further details
Data in reduced space (symbolic interval data: a 3-dimensional table)
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
PCA.centers.SDA
,
PCA.mrpca.SDA
,
PCA.spaghetti.SDA
,
PCA.vertices.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
principal component analysis for symbolic objects described by symbolic interavl variables. Vertices algorithm
PCA.vertices.SDA(t,pc.number=2)
PCA.vertices.SDA(t,pc.number=2)
t |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
pc.number |
number of principal components |
See file ../doc/PCA_SDA.pdf for further details
Data in reduced space (symbolic interval data: a 3-dimensional table)
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
PCA.centers.SDA
,
PCA.mrpca.SDA
,
PCA.spaghetti.SDA
,
PCA.spca.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Random forest algorithm for optimal split based decision tree for symbolic objects
random.forest.SDA(sdt,formula,testSet, mfinal = 100,...)
random.forest.SDA(sdt,formula,testSet, mfinal = 100,...)
sdt |
Symbolic data table |
formula |
formula as in ln function |
testSet |
a vector of integers indicating classes to which each objects are allocated in learnig set |
mfinal |
number of partial models generated |
... |
arguments passed to decisionTree.SDA function |
random.forest.SDA implements Breiman's random forest algorithm for classification of symbolic data set.
Section details goes here
Andrzej Dudek [email protected] Marcin Pełka [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
bagging.SDA
,boosting.SDA
,decisionTree.SDA
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Replication analysis for cluster validation of symbolic data
replication.SDA(table.Symbolic, u=2, method="SClust", S=10, fixedAsample=NULL, ...)
replication.SDA(table.Symbolic, u=2, method="SClust", S=10, fixedAsample=NULL, ...)
table.Symbolic |
symbolic data table |
u |
number of clusters given arbitrarily |
method |
clustering method: "SClust" (default), "DClust", "single", "complete", "average", "mcquitty", "median", "centroid", "ward", "pam", "diana" |
S |
the number of simulations used to compute average adjusted Rand index |
fixedAsample |
if NULL A sample is generated randomly, otherwise this parameter contains object numbers arbitrarily assigned to A sample |
... |
additional argument passed to |
See file ../doc/replicationSDA_details.pdf for further details
A |
3-dimensional array containing data matrices for A sample of objects in each simulation (first dimension represents simulation number, second - object number, third - variable number) |
B |
3-dimensional array containing data matrices for B sample of objects in each simulation (first dimension represents simulation number, second - object number, third - variable number) |
medoids |
3-dimensional array containing matrices of observations on u representative objects (medoids) for A sample of objects in each simulation (first dimension represents simulation number, second - cluster number, third - variable number) |
clusteringA |
2-dimensional array containing cluster numbers for A sample of objects in each simulation (first dimension represents simulation number, second - object number) |
clusteringB |
2-dimensional array containing cluster numbers for B sample of objects in each simulation (first dimension represents simulation number, second - object number) |
clusteringBB |
2-dimensional array containing cluster numbers for B sample of objects in each simulation according to 4 step of replication analysis procedure (first dimension represents simulation number, second - object number) |
cRand |
value of average adjusted Rand index for S simulations |
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science,Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Breckenridge, J.N. (2000), Validating cluster analysis: consistent replication and symmetry, "Multivariate Behavioral Research", 35 (2), 261-285. Available at: doi:10.1207/S15327906MBR3502_5.
Gordon, A.D. (1999), Classification, Chapman and Hall/CRC, London. ISBN 9781584880134.
Hubert, L., Arabie, P. (1985), Comparing partitions, "Journal of Classification", no. 1, 193-218. Available at: doi:10.1007/BF01908075.
Milligan, G.W. (1996), Clustering validation: results and implications for applied analyses, In P. Arabie, L.J. Hubert, G. de Soete (Eds.), Clustering and classification, World Scientific, Singapore, 341-375. ISBN 9789810212872.
Bock H.H., Diday E. (eds.) (2000), Analysis of Symbolic Data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
dist_SDA
, SClust
, DClust
; hclust
in stats
library; pam
in cluster
library; replication.Mod
in clusterSim
library
#data("cars",package="symbolicDA") #set.seed(123) #w<-replication.SDA(cars, u=3, method="SClust", S=10) #print(w)
#data("cars",package="symbolicDA") #set.seed(123) #w<-replication.SDA(cars, u=3, method="SClust", S=10) #print(w)
It reads a symbolic data table from a CSV file or converts RSDA object to SymbolicDA "symbolic" class type object
RSDA2SymbolicDA(rsda.object=NULL,from.csv=F,file=NULL , header = TRUE, sep, dec, row.names = NULL)
RSDA2SymbolicDA(rsda.object=NULL,from.csv=F,file=NULL , header = TRUE, sep, dec, row.names = NULL)
rsda.object |
object of class "symb.data.table" from (former) RSDA package) |
from.csv |
object of class "symb.data.table" from (former) RSDA package) |
file |
optional, The name of the CSV file in RSDA format (see details) |
header |
As in R function read.table |
sep |
As in R function read.table |
dec |
As in R function read.table |
row.names |
As in R function read.table |
(as in (former) RSDA package) The labels $C means that follows a continuous variable, $I means an interval variable, $H means a histogram variables and $S means set variable. In the first row each labels should be follow of a name to variable and to the case of histogram a set variables types the names of the modalities (categories) . In data rows for continuous variables we have just one value, for interval variables we have the minimum and the maximum of the interval, for histogram variables we have the number of modalities and then the probability of each modality and for set variables we have the cardinality of the set and next the elements of the set.
The format is the CSV file should be like:
$C F1 $I F2 F2 $H F3 M1 M2 M3 $S F4 E1 E2 E3 E4
Case1 $C 2.8 $I 1 2 $H 3 0.1 0.7 0.2 $S 4 e g k i
Case2 $C 1.4 $I 3 9 $H 3 0.6 0.3 0.1 $S 4 a b c d
Case3 $C 3.2 $I -1 4 $H 3 0.2 0.2 0.6 $S 4 2 1 b c
Case4 $C -2.1 $I 0 2 $H 3 0.9 0.0 0.1 $S 4 3 4 c a
Case5 $C -3.0 $I -4 -2 $H 3 0.6 0.0 0.4 $S 4 e i g k
The internal format is:
$N
[1] 5
$M
[1] 4
$sym.obj.names
[1] 'Case1' 'Case2' 'Case3' 'Case4' 'Case5'
$sym.var.names
[1] 'F1' 'F2' 'F3' 'F4'
$sym.var.types
[1] '$C' '$I' '$H' '$S'
$sym.var.length
[1] 1 2 3 4
$sym.var.starts
[1] 2 4 8 13
$meta
$C F1 $I F2 F2 $H F3 M1 M2 M3 $S F4 E1 E2 E3 E4
Case1 $C 2.8 $I 1 2 $H 3 0.1 0.7 0.2 $S 4 e g k i
Case2 $C 1.4 $I 3 9 $H 3 0.6 0.3 0.1 $S 4 a b c d
Case3 $C 3.2 $I -1 4 $H 3 0.2 0.2 0.6 $S 4 2 1 b c
Case4 $C -2.1 $I 0 2 $H 3 0.9 0.0 0.1 $S 4 3 4 c a
Case5 $C -3.0 $I -4 -2 $H 3 0.6 0.0 0.4 $S 4 e i g k
$data
F1 F2 F2.1 M1 M2 M3 E1 E2 E3 E4
Case1 2.8 1 2 0.1 0.7 0.2 e g k i
Case2 1.4 3 9 0.6 0.3 0.1 a b c d
Case3 3.2 -1 4 0.2 0.2 0.6 2 1 b c
Case4 -2.1 0 2 0.9 0.0 0.1 3 4 c a
Case5 -3.0 -4 -2 0.6 0.0 0.4 e i g k
Return a symbolic data table in form of SymbolicDA "symbolic" class type object.
Andrzej Dudek
With ideas from RSDA package by Oldemar Rodriguez Rojas
Bock H.H., Diday E. (eds.) (2000), Analysis of Symbolic Data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
display.sym.table
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
saves symbolic data table of 'symbolic' class to xml file (ASSO format)
save.SO(sdt,file)
save.SO(sdt,file)
sdt |
Symbolic data table |
file |
file name with extension |
see symbolic.object
for symbolic data table R structure representation
No value returned
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
generate.SO
,subsdt.SDA
,parse.SO
#data("cars",package="symbolicDA") #save.SO(cars,file="cars_backup.xml")
#data("cars",package="symbolicDA") #save.SO(cars,file="cars_backup.xml")
Dynamical clustering of symbolic data based on symbolic data table
SClust(table.Symbolic, cl, iter=100, variableSelection=NULL, objectSelection=NULL)
SClust(table.Symbolic, cl, iter=100, variableSelection=NULL, objectSelection=NULL)
table.Symbolic |
symbolic data table |
cl |
number of clusters or vector with initial prototypes of clusters |
iter |
maximum number of iterations |
variableSelection |
vector of numbers of variables to use in clustering procedure or NULL for all variables |
objectSelection |
vector of numbers of objects to use in clustering procedure or NULL for all objects |
See file ../doc/SClust_details.pdf for further details
a vector of integers indicating the cluster to which each object is allocated
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Bock, H.H., Diday, E. (eds.) (2000), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester, pp. 185-191.
Verde, R. (2004), Clustering Methods in Symbolic Data Analysis, In: D. Banks, L. House, E. R. McMorris, P. Arabie, W. Gaul (Eds.), Classification, clustering and Data mining applications, Springer-Verlag, Heidelberg, pp. 299-317.
Diday, E. (1971), La methode des Nuees dynamiques, Revue de Statistique Appliquee, Vol. 19-2, pp. 19-34.
Celeux, G., Diday, E., Govaert, G., Lechevallier, Y., Ralambondrainy, H. (1988), Classifcation Automatique des Donnees, Environnement Statistique et Informatique - Dunod, Gauthier-Villards, Paris.
DClust
; kmeans
in stats
library
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #clust<-SClust(sdt, cl=3, iter=50) #print(clust)
# LONG RUNNING - UNCOMMENT TO RUN #data("cars",package="symbolicDA") #sdt<-cars #clust<-SClust(sdt, cl=3, iter=50) #print(clust)
Change of representation of symbolic data from simple form to symbolic data table
simple2SO(x)
simple2SO(x)
x |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals |
see symbolic.object
for symbolic data table R structure representation
Symbolic data table in full form
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
link{SO2Simple}
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
Change of representation of symbolic data from symbolic data table to simple form
SO2Simple(sd)
SO2Simple(sd)
sd |
Symbolic data table in full form |
see symbolic.object
for symbolic data table R structure representation
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
link{simple2SO}
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
This method creates symbolic data table containing only objects, whose indices are given in secong argument
subsdt.SDA(sdt,objectSelection)
subsdt.SDA(sdt,objectSelection)
sdt |
Symbolic data table |
objectSelection |
vector containing symbolic object numbers, default value - all objects from sdt |
see symbolic.object
for symbolic data table R structure representation
Symbolic data table containing only objects, whose indices are given in secong argument. The result is of 'symbolic' class
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
These are objects representing symbolic data table structure
For all fields symbol N.A. means not available value.
For futher details see ../doc/SDA.pdf
individuals |
data frame with one row for each row in symbolic data table with following columns:
|
variables |
data frame with one row for each column in symbolic data table with following columns:
|
detailsC |
data frame describing symbolic continous (metric, single-valued) variables details with following columns:
|
detailsIC |
data frame describing symbolic inter-continous (symbolic interval) variables details with following columns:
|
detailsN |
data frame describing symbolic nominal and multi nominal variables details with following columns:
|
detailsListNom |
data frame describing every category of symbolic nominal and multi nominal variables, with following columns:
|
detailsNM |
data frame describing symbolic multi nominal modiff (categories sets with weights) variables details with following columns:
|
detailsListNomModif |
data frame describing every category of symbolic multi nominal modiff variables, with following columns
|
indivIC |
array of symbolic interval variables realizations, with dimensions nr_of_objects X nr_of_variables X 2 containing beginnings and ends of intervals for given object and variable. For values different type than symbolic interval array contains zeros |
indivC |
array of symbolic continues variables realizations, with dimensions nr_of_objects X nr_of_variables X 1 containing single values - realizations of variable on symbolic object. For values different type than symbolic continous array contains zeros |
indivN |
data frame describing symbolic nominal and multi nonimal variables realizations with folowing columns:
When this data frame contains line i,j,k it means that category k belongs to set that is realization of j-th symbolic variable on i-th symbolic object. |
indivNM |
data frame describing symbolic multi nonimal modiff variables realizations with folowing columns:
When this data frame contains line i,j,k,w it means that category k belongs to set that is realization of j-th symbolic variable on i-th symbolic object with weight(probability) w. |
The following components must be included in a legitimate symbolic
object.
Multidimensional scaling for symbolic interval data - symScal algorithm
symscal.SDA(x,d=2,calculateDist=FALSE)
symscal.SDA(x,d=2,calculateDist=FALSE)
x |
symbolic interval data: a 3-dimensional table, first dimension represents object number, second dimension - variable number, and third dimension contains lower- and upper-bounds of intervals (Simple form of symbolic data table) |
d |
Dimensionality of reduced space |
calculateDist |
if TRUE x are treated as raw data and min-max dist matrix is calulated. See details |
SymScal, which was proposed by Groenen et. al. (2005), is an adaptation of well-known nonmetric multidimensional scaling for symbolic data. It is an iterative algorithm that uses STRESS objective function. This function is unnormalized. IScal, like Interscal and SymScal, requires interval-valued dissimilarity matrix. Such dissmilarity matrix can be obtained from symbolic data matrix (that contains only interval-valued variables), judgements obtained from experts, respondents. See Lechevallier Y. (2001) for details on calculating interval-valued distance. See file ../doc/Symbolic_MDS.pdf for further details
xprim |
coordinates of rectangles |
STRESSSym |
final STRESSSym value |
Andrzej Dudek [email protected]
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/symbolicDA/
Billard L., Diday E. (eds.) (2006), Symbolic Data Analysis, Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester.
Bock H.H., Diday E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday E., Noirhomme-Fraiture M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
Groenen P.J.F, Winsberg S., Rodriguez O., Diday E. (2006), I-Scal: multidimensional scaling of interval dissimilarities, Computational Statistics and Data Analysis, 51, pp. 360-378. Available at: doi:10.1016/j.csda.2006.04.003.
# Example will be available in next version of package, thank You for your patience :-)
# Example will be available in next version of package, thank You for your patience :-)
plot in a form of zoom star chart for symbolic object described by interval-valued, multivalued and modal variables
zoomStar(table.Symbolic, j, variableSelection=NULL, offset=0.2, firstTick=0.2, labelCex=.8, labelOffset=.7, tickLength=.3, histWidth=0.04, histHeight=2, rotateLabels=TRUE, variableCex=NULL)
zoomStar(table.Symbolic, j, variableSelection=NULL, offset=0.2, firstTick=0.2, labelCex=.8, labelOffset=.7, tickLength=.3, histWidth=0.04, histHeight=2, rotateLabels=TRUE, variableCex=NULL)
table.Symbolic |
symbolic data table |
j |
symbolic object number in symbolic data table used to create the chart |
variableSelection |
numbers of symbolic variables describing symbolic object used to create the chart, if NULL all variables are used |
offset |
relational offset of chart (margin size) |
firstTick |
place of first tick (relational to lenght of axis) |
labelCex |
labels cex parameter of labels |
labelOffset |
relational offset of labels |
tickLength |
relational length of single tick of axis |
histWidth |
histogram (for modal variables) relational width |
histHeight |
histogram (for modal variables) relational heigth |
rotateLabels |
if TRUE labels are rotated due to rotation of axes |
variableCex |
cex parameter of names of variables |
zoom star chart for selected symbolic object in which each axis represents a symbolic variable. Depending on the type of symbolic variable their implementations are presented as:
a) rectangle - interval range of interval-valued variable),
b) circles - categories of multinominal (or multinominal with weights) variable from among coloured circles means categories of the variable observed for the selected symbolic object
bar chart - additional chart for multinominal with weights variable in which each bar represents a weight (percentage share) of a category of the variable
Andrzej Dudek [email protected], Justyna Wilk [email protected] Department of Econometrics and Computer Science, Wroclaw University of Economics, Poland http://keii.ue.wroc.pl/symbolicDA/
Bock, H.H., Diday, E. (eds.) (2000), Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data, Springer-Verlag, Berlin.
Diday, E., Noirhomme-Fraiture, M. (eds.) (2008), Symbolic Data Analysis with SODAS Software, John Wiley & Sons, Chichester.
plotInterval
in clusterSim
# LONG RUNNING - UNCOMMENT TO RUN # Example 1 #data("cars",package="symbolicDA") #sdt<-cars #zoomStar(sdt, j=12) # Example 2 #data("cars",package="symbolicDA") #sdt<-cars #variables<-as.matrix(sdt$variables) #indivN<-as.matrix(sdt$indivN) #dist<-as.matrix(dist_SDA(sdt)) #classes<-DClust(dist, cl=5, iter=100) #for(i in 1:max(classes)){ #getOption("device")() #zoomStar(sdt, .medoid2(dist, classes, i))}
# LONG RUNNING - UNCOMMENT TO RUN # Example 1 #data("cars",package="symbolicDA") #sdt<-cars #zoomStar(sdt, j=12) # Example 2 #data("cars",package="symbolicDA") #sdt<-cars #variables<-as.matrix(sdt$variables) #indivN<-as.matrix(sdt$indivN) #dist<-as.matrix(dist_SDA(sdt)) #classes<-DClust(dist, cl=5, iter=100) #for(i in 1:max(classes)){ #getOption("device")() #zoomStar(sdt, .medoid2(dist, classes, i))}