For the first initial experiment we used Trajan Neural Network Simulator, by Trajan Software, Ltd.
This package has the ability to construct a vast number of neural networks,
and only retain neural networks with the best performance (usually networks with 50% performance and better).
In addition, the software has the ability to effectively search for a good subset of input attributes, discarding unimportant ones.
It is also possible to conduct a sensitivity analysis and gain some important insights into the usefulness of individual variables.
Before the search was initiated, we specified a ratio of 2:1:1 for the random selection of training, cross-verification, and testing sets.
After converting the training set to an appropriate format, an exhaustive search for Linear, Radial Basis Function,
and Multilayer Perceptron neural networks was performed.
After the search, Trajan retained a set of 20 networks.
The best two neural networks were Multilayer Perceptron networks with performance between 0.7096774 and 0.7419355:
|MLP||0.4091044 ||15 ||64 ||0.7096774
We decided to keep the last neural network, since it contained:
1. The smallest amount of cross verification error (0.4091044)
This neural network had a topology of all 15 attributes as input nodes, 64 hidden nodes, and one output node (topology: 15-64-1):
2. Cross verification classification rate 0.709677
3. Area under ROC curve 0.719823.
the automated neural
network search with Trajan, we decided to apply the same topology while using the backprop machine learning algorithm at the University of Montana Computer Science Department. For this particular experiment we ran the following (each with a ten-fold cross-validation):
1. Default option for the command file (result: Test Set Fraction Correct: Total results = 60/124 = 0.484).
Looking at the results, it was evident that the most favorable prediction rate (.597) was achieved in using a learning rate value of 0.25 and percent_validation value of 0.1 (run #2). Below is the resulting confusion matrix:
2. Change the learning rate in the command file to 0.25 (result: Test Set Fraction Correct: Total results = 74/124 = 0.597).
3. Change percent validation in the command file to 0 (result: Test Set Fraction Correct: Total results = 54/124 = 0.435).
Test Set Confusion Matrix Total results
|Correct ||30 ||44|
|Incorrect ||32 ||18|
|Total ||62 ||62 |
Test Set Fraction Correct: Total results = 74/124 = 0.597
This result is a lot better than the other two backprop runs (over 10% improvement of the correct classification rate), being more consistent with initial results from Trajan.
The last experiment using this training data set was with the improved version of the ID3 decision tree building algorithm C4.5. This experiment consisted of building the decision tree and running the data set using the following two options:
1. Information Gain Ratio (with and without pruning—default options when pruning).
Below are the final ten-fold cross validation results:
2. Information Gain (with and without pruning—default options).
| ||Before Pruning||After Pruning
| ||Size||Errors||Size||Errors||% pruned
From these results is evident that using Gain considerably reduced the error, especially for the test runs.
Pruning the tree did not reduce size significantly.
The error increased slightly during training for both Gain and Gain Ratio.
However, Gain produced smaller %error for the training and test runs in comparison with Gain Ratio.
Although the error during training for Gain increased after pruning, the test error did not change.
The conclusion of this experiment is that Gain is the most useful option, with reduced error and smallest tree after pruning.
All three experiments produced roughly similar classification results.
Backprop and Trajan achieved a total correct classification rate of approximately 60 % (Trajan produced a 70% classification rate for the cross verification subset, average classification error for all subsets was about 60%).
In addition, the decision tree algorithm C4.5 specified error just under 40 percent using Gain, thus having the ability of approximately 60% to generalize well on new data.
These results are very encouraging considering the limited type of data in this training set and the modest number of training cases.
According to the sensitivity analysis with Trajan, the 5 most important attributes proved to be:
1. Whether or not a plant is annual.
2. Number of counties in the five northwestern states reporting infestations.
3. Whether or not a plant is perennial.
4. Number of European/Asian countries where the plant is exotic.
5. The native latitudinal range in Europe and Asia.
Based on the results from the
experimental work, we performed a number of refinements in
order to increase our prediction rate. These refinements
consisted of introducing additional variables. One attribute
that has shown significance in earlier studies (Reichard &
Hamilton, 1996) is whether or not a plant is known to invade
elsewhere (i.e. “Invades elsewhere: Yes/No”). Since not all
exotic plants use the same method of reproduction, Rejmanec’s
theory on seed analysis as an indicator for invasiveness could
not be applied. The pines, which Rejmanec studied, reproduce
only from seed. For many noxious weeds, vegetative
reproduction from rhizomes is an important means of population
increase and spread. A variable that addresses reproduction is
appropriate(reproduction: seed, vegetative).