SAS Training Code from Data Mining Using SAS® Enterprise Miner

 

Sample Nodes

·

Sample Node: The training code that performs a random sample of the source data set.

·

Sample Node (Nth): The training code that performs systematic sampling of the source data set by randomly selecting the starting point to begin the sample, then selecting every nth observation thereafter.

·

Sample Node (Stratified): The training code that performs stratified sampling of the source data set where the categorical-valued variable control the separation of the observations into non-overlapping groups, then taking a simple random sample from each group that is created.

·

Sample Node (First N): The training code that performs sequential sampling by selecting the first n observations of the source data set.

·

Sample Node (Cluster-Random): The training code that performs cluster sampling of the source data set where the observations are selected in clusters or groups, then randomly selecting the number of clusters that are created.

·

Data Partition Node (Random Sample): The training code that partitions the initial data set by randomly sampling the source data set into separate data sets.

·

Data Partition Node (Stratified Sample): The training code that partitions the initial data set by performing a stratified sample of the source data set based on the class levels of the categorical-valued input variable.  A simple random sample is drawn from each non-overlapping group that is created.

·

Data Partition Node (User-Defined Sample): The training code that displays the PROC SQL procedure code that performs a user-defined sample of the source data set by creating the training and validation data sets from each class level of the binary-valued variable or the training, validation and test data set from three separate class levels of the categorical-valued variable.

Explore Nodes

·

Distribution Explorer Node: The training code that displays the PROC MEANS procedure that generates the descriptive statistics by the range of values of the interval-valued axes variables to the 3-D bar chart from the HMEQ data set.

·

Multiplot Node: The training code that displays the PROC GCHART procedure that creates numerous bar charts and the PROC GPLOT that creates numerous scatter plots from the HMEQ data set.

·

Association Node (Association Analysis): The training code that displays the PROC SORT procedure that first sorts the data by each customer, then generates all combination of items up to 2-way items that are associated with each other from the PROC ASSOC procedure with the PROC RULEGEN procedure then used to create the various evaluation criteria statistics, i.e. support, confidence and lift statistics from association analysis.

·

Association Node (Sequence Analysis): The training code that displays the PROC SORT procedure that first sorts the data by each customer within each time identifier variable with the PROC ASSOC procedure then used to determine the best set of items that are related to each other. The PROC SEQUENCE procedure is then applied to create the rules within each sequence from the related items that are selected from the previous PROC ASSOC procedure.

·

Variable Selection Node (Chi-Sq): The training code that displays the DMSPLIT procedure that performs the variable selection routine in selecting the best combination of input variables to the logistic regression model based on the Chi-Sq modeling criteria statistic by fitting the binary-valued target variable in predicting bad creditors from the HMEQ data set in which some of the input variables are transformed by applying the logarithm function or binning certain interval-valued input variables into categorical input variables to increase the classification performance of the model. 

·

Variable Selection Node (R-Square): The training code that displays the DMSPLIT procedure that performs the variable selection routine in selecting the best combination of input variables to the least regression model based on the r-square modeling criteria statistic by fitting the interval-valued target variable in predicting debt-to-income ratio from the HMEQ data set in which some of the input variables are transformed by applying the logarithm function or binning certain interval-valued input variables into categorical input variables to increase the predictive power of the model. 

Modify Nodes

·

Clustering Node: The training code that displays the PROC FASTCLUS procedure that creates the initial cluster seeds and temporary clusters that are created from the k-means clustering technique in grouping the 2004 major league baseball  hitters.

·

SOM/Kohonen Node (SOM): The training code that the PROC DMVQ procedure with the various SOM option settings that generates the clustering assignments from the Kohonen SOM clustering procedure in grouping the 2004 major league baseball  hitters.

·

SOM/Kohonen Node (VQ): The training code that generates the clustering assignments from the Kohonen VQ clustering procedure from the 2004 major league baseball  hitters.

·

Time Series Node: The training code that displays the PROC TIMESERIES procedure that creates the time series data set in preparation to time series modeling. The procedure generate the various descriptive statistics from the active training data set and the accumulated 12-month seasonal data set.

·

Interactive Grouping Node: The training code that generates the various grouping of the input variables based on the binary-valued target variable.

Model Nodes

·

Regression Node (Least-Squares): The training code that displays the PROC DMREG procedure that generates the multiple linear regression output listings from the HMEQ data set. The procedure will display the various option settings that have been specified within the node. The data set that is created from the SCORE option statement contains the fitted values that will allow you to plot the fitted values across the range of values of the interval-valued input variables in order to view the accuracy of the statistical model.

·

Regression Node (Logistic): The training code that displays the PROC DMREG procedure that generates the logistic regression output listings from the HMEQ data set.

·

Neural Network Node (interval): The training code that displays the PROC NEURAL procedure with the various options settings that have been specified from the node in order to generate the neural networks estimates from the HMEQ data set by fitting the interval-valued target variable, DEBTINC. The SAS training code is compiled in the background when you execute the Neural Network node and train the neural network model that generates the various results and scored data sets that are listed within the node.

·

Neural Network Node (binary): The training code that displays the PROC NEURAL procedure by fitting the binary-valued target variable, BAD. The neural network model is one of the classification models under assessment from the Assessment node. 

·

Princomp/Dmneural Node (Dmneural): The training code that displays the DMNEURAL procedure that generates the dmneural network modeling estimates from the HMEQ data set. The procedure displays the various option settings that have been specified within the node such as the criterion statistic, maximum number of principal components and the maximum number of stages and the various convergence criterion statistics to the iterative nonlinear model.

·

Princomp/Dmneural Node (Principal Components): The training code that displays the DMNEURAL procedure that generates the principal components estimates from the 2004 major league baseball  hitters. The principal components are calculated from the correlation matrix since the input variables display a wide range of values from the various hitting departments.  

·

Memory-Based Reasoning Node: The training code that displays the PMBR procedure that generates the nearest neighbor modeling estimates with the various options specified from the node such as the smoothing constant set to 16 from the HMEQ data set.

·

Two-Stage Model Node: The training code that generates the two-stage modeling estimates by first displaying the PROC SPLIT procedure that generates the decision tree classification model with the target event level that is one of the input variables that is included in fitting the subsequent multiple linear regression model from the HMEQ data set based on the PROC DMREG procedure in predicting the interval-valued target variable in the second stage model.

Back to Page

 

© copyright www.sasenterpriseminer.com - SAS data mining training code.