Since the inception of AFIT's data analytics distance learning certificate program in 2019, 147 students have successfully completed all program requirements representing nearly every MAJCOM, Headquarters Staff, and a variety of agencies.
Data analytics refers to an emerging field focused on extracting key insights from data using a disciplined, process-based approach which includes analysis, collection, organization, and storage of data, as well as the implementation of computer-based tools and technologies.
The primary goal is to apply statistical analysis along with machine learning techniques on datasets to uncover trends and answer specific questions of interest. To accomplish this task, data analysts draw from a range of disciplines including computer science, programing, and statistics to develop the key skills essential to conduct quality data analysis.
The Air Force, in striving to be a data-driven organization – one that purposely collects, creates, shares and acts upon quality, authoritative data in and across mission areas – requires analytic talent at the unit level, where most data is generated. Unfortunately, the human capital needed to effectively analyze data for operational decision making is lacking and has the longest lead time to develop or acquire. Consequently, it is critical to empower both civilian and military Airmen with data analytics education.
In 2019, at the behest of the Air Force Chief Information Officer, AFIT enrolled the first cohort of 40 students into a newly created data analytics distance learning certificate program. The program specifically targeted personnel of all ranks and career fields, requiring only a bachelor’s degree and basic math background.
The program consists of a 5-course sequence covering key data analytics skills such as data management/wrangling, statistics, and applied machine learning, all supported by a fundamental understanding and use of software coding.
One key objective of the data analytics certificate is to push data analysis skills to the broader Air and Space Force communities. Since inception, 399 students have enrolled in the program with 147 students successfully completing all program requirements, representing nearly every MAJCOM, Headquarters Staff, and a variety of agencies.
The success of the data analytics program has even extended beyond the Air and Space Forces, drawing the attention of organizations such as Lawrence Livermore National Labs and the National Geospatial Agency, both of which have formalized arrangements with AFIT to send students to participate in the certificate program.
Students enrolled in the data analytics program complete a capstone project demonstrating knowledge and skills gained through the program. Below are examples of recent capstone projects.
Classifying Chinese News Headlines Using Natural Language Processing
Categorizing news articles by the text of their headlines can be an efficient way to sort the volume of articles into categories for review. In this project, a dataset of 380,000 Chinese language articles in 14 topic areas was modeled using multiple python-based natural language processing (NLP) approaches. Character-based and word-based approaches to selecting the most important model inputs were used, and the best approach was a word-based Naïve Bayes model. The python jieba library was created to segment Chinese text into “words” for this NLP tasks due to the language’s lack of spaces separating words. The best model was 91% accurate at classifying the news articles into the 14 categories, as measured on a dataset that was not used for training the model. The performance of the military news topic was comparable to all other categories. The results showed that these machine learning NLP models could effectively sort Chinese articles by the words in their titles and enable more efficient processing of articles and avoiding the extra step of translating the headlines by hand.
URL-based Malicious Website Prediction
One cybersecurity method to prevent connections to malicious domains is to restrict users from accessing malicious websites. However, new malicious websites are constantly being added, and a blacklist would be outdated as soon as it was created. The approach used in this work is to evaluate characteristics of the website uniform resource locator (URL), such as amazon.com or 06absence01.yolasite.com, to determine if it is malicious. This type of model would be able to identify new malicious websites as they appear. This model was trained on a dataset of 179,000 URLs where 53% were normal and 47% were malicious. Using only the URL, 15 features were extracted that related to URL subcomponents, character statistics, and randomness metrics. The best model was a neural network model with four hidden layers that accurately identified malicious websites 91% of the time, as measured on a 25% split of the dataset that was not used for training. This model drastically improved on a random model that possessed an accuracy of 47%. With this model, cybersecurity teams could potentially identify and respond to malicious domains quickly and without needing third-party enrichment techniques.
Predicting the Expected Lifetime of Satellites
An accurate prediction of a satellite’s life expectancy can potentially save millions of dollars and prevent gaps in capability. In this study, the manufacturer’s expected lifespan of 3,095 satellites was predicted using nine satellite features including type of orbit, mass, customer, and 6 attributes of the orbit. The lifespans ranged from ¼ year to 30 years and had an average lifespan of 6.3 years. The best neural network model predicted the satellite lifespan with a mean absolute error of 0.47 years, as measured on a holdout dataset. In order to obtain the best model, the neural network models explored hyperparameter sweeps on the number of neurons, the number of layers, the learning rate and the number of epochs. In future studies, the model could potentially be improved using space weather data, as space weather events may affect overall satellite lifetime.