Implementing a Machine Learning System to identify at risk students.
Why Machine Learning?
One of the many reasons data science and data analytics has exploded in recent years is this ability we have to take data and make predictions about the future.
For instance we can use data to predict:
- sale price of a home
- box office revenue for a new film
- customer churn
- whether a financial transaction is fraudulent
- business revenue
- healthcare prognoses
- high school graduation rates
- birth rates
- disease spread
Now we’re not magicians — we simply take data, analyze it, and use it to create models. These models then give us predictions about future (unseen) data.
We all make simple predictions every day. For instance — say you use public transportation. You take the same bus to school or work every morning for 4 years. It arrives at 8:15. Now during that time — you notice that each time it snows — the bus is at least 30 minutes late. So Monday morning you wake up — and realize it started snowing overnight and there’s 6 inches of snow on the ground. It’s also 20º with a wind chill that make it feel like it 5º. Are you outside at 8:15?
I’m not — and I’ll bet most people aren't. We are making the prediction based on historical data that the bus will be at least 30 minutes late. This in turn changes our behavior.
In terms of a business context — predictions provide business’ with insights that result in real value for the business. For example if we ran a model to predict customer churn among Netflix users — Netflix could then target the user’s who were at risk of churning in a specific campaign — maybe offering those customers a discount on services.
This blog summarizes a 2016 paper titled “A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes”. It aimed to predict which students were at risk for adverse academic outcomes. Specifically — those students who were at risk of not graduating high school on time. The authors Himabindu Lakkaraju, Everaldo Aguiar, Carl Shan, David Miller, Nasir Bhanpuri , Rayid Ghani, and Kecia L. Addison presented several models that were used to develop early interventions targeting students at risk.
The link to the article can be found here:
A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes
High school graduation rates have been on an upward trend since 2010.
Overall in 2020–2021- the high school graduation rate has increased to 86.2%. Despite this, we still have almost 750,000 students who don’t finish high school on time.
Studies have shown that not graduating high school on time impacts a student’s future career prospects immensely [1, 2]. “According to the authors of this study — students who do not graduate on time can strain school districts’ resources. To address this issue, school districts have been heavily investing in the construction and deployment of intervention programs to better support at risk students and their individual needs. The success of these individualized intervention programs depends on schools’ ability to accurately identify and prioritize students who need help.” [3]
The DATA:
The authors partnered with 2 large school districts
District A
- Two cohorts — 10884 and 10829 students, expected to graduate in 2012 and 2013 respectively. Most of the students in these cohorts were tracked from 6th — 12th grade.
District B
- Two cohorts — 1499 and 1575 students, expected to graduate in 2012 and 2013 respectively. Most of the students in these cohorts were tracked from 8th — 12th grade.
The TASK:
- “off track” — more difficult to help a student who is off track — ie gpa is dropping, repeating grades.
- Identify at- risk students as early as possible in order to maximize resources.
- Identifying at risk students before they became off-track.
- Build a model that can assign a risk score for each student.
The RESULTS:
The authors used several metrics — accuracy, AUC, precision, and recall. Several models — Random Forest, Decision Trees, Support Vector Machines, Adaboost, and Logistic Regression were built to identify at risk students.
Most of the classification models the authors used assigned confidence/probability estimates to each of the data points. They chose to use these estimates to rank students and assign risk scores.
Overall the researchers decided on an algorithm provided by the strongest model — Random Forest. They were able to identify at- risk students and assign risk scores for students.
Why is this relevant?
Predictions allow businesses and individuals to make changes to their behavior, or implement intervention strategies to address adverse events.
We saw two examples above — an individual making altering their behavior in response to a prediction (not being at the bus stop at 8:15). We saw another example in the paper below. A school implementing a strategy to help students.
This is important because predictions can be used across almost all industries
- manage staffing / resources at a restaurant or store ( Black Friday/ Holiday Shopping)
- implementing healthcare interventions ( if a person is at risk of developing conditions — cancer, high blood pressure, heart disease, or diabetes — you can screen earlier, provide education, help patients develop skills for decreasing stress)
- creating promotions to attract new members (gym)
If you’d like to read more about the models the authors developed — here’s the link. http://www.dssgfellowship.org//wp-content/uploads/2016/04/montogmery-kd2015.pdf
Citations:
- Building a Grad Nation. http: //www.americaspromise.org/sites/default/files/ legacy/bodyfiles/BuildingAGradNation2012.pdf.
- A. J. Bowers, R. Sprott, and S. A. Taff. Do we know who will drop out?: A review of the predictors of dropping out of high school: Precision, sensitivity, and specificity. The High School Journal, 96(2):77–100, 2013.
- Lakkaraju, Himabindu, Everaldo Aguiar, Carl Shan, David Miller, Nasir Bhanpuri, Rayid Ghani, and Kecia Addison. “A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes.” Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining 21st (2015).