A case study: using agile method for big data analytics projects
By Stéphane Déprès - Let’s take the example of a big data analytics project to understand how Agile method can be used on these kinds of projects.
The objective of our sample project is to use telematic data to identify the drivers’ signature.
We have data from a large set of drivers’ trips: the every second recording of the car’s position.
For a subset of the trips (including all drivers) we know who is driving the car and for other trips, we want to know who is driving based on the driver’s behavior recognition.
Clearly we have to use machine learning techniques with supervised learning. This infers a standard machine learning process:
- Explore data and derive features that will be used by our predictive model
- Separate data with known drivers into two subsets: one trainingset for the supervised learning phase, and another set to analyze the performance of our classifier
- Apply a classifier on the training set and analyze its performance using the other data set. Examples of classifiers are Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forests, Deep Learning.
- If the volume of data to be analyzed is huge, we may industrialize the process and use MapReduce on Hadoop to increase the performance
Now let’s look in detail at the project implementation. We chose to use Scrum and to adapt it if needed.
The sprint 1 goal was to explore data and to derive featuresthat will be used by our predictive model
- From the car’s position raw data we derived some intermediate data: speed, acceleration and curvature
- In exploring speed and curvature data samples, we understood that some parts of the trips were in fact parking manoeuvers, that other parts of the trips corresponded to highway driving, and that there were of course stops during the trips…
- From the speed, acceleration and curvature we calculated our features in order to identify the driver from his behavior. We selected the following features in sprint 1:
- discretized distribution of trips length
- discretized distribution of speed in a straight line
- discretized distribution of speed after a stop (acceleration after a stop)
- discretized distribution of speed before a stop (braking strength)
- During the sprint we also explored feature data samples to analyze their relevance
During sprint 2, we applied a first classifier on the selected features and we analyzed the performance using the second data set. The results were clearly poor!
During sprint 3, the goal was to improve the relevance of the results by introducing new features and by changing the classifier. In particular we added the following two new features:
- “Fast Fourier Transform” of the speed to get periodicities of speed fluctuation due to enslavement
- correlation between acceleration and curvature
The relevance of the classification was still not very good…
We then performed additional iterations (introducing new features / improving the classifier) until result relevance was acceptable…
From an Agile viewpoint, we have learnt the following:
- The project is necessarily iterative as some decisions have to be taken based on some increments’ results: are the chosen features relevant? What is the best classifier?
- As these kinds of decisions have to betaken frequently, it is not possible to wait for the sprint review of a 2 or 3 weeks sprint. We therefore recommend very short time-boxed sprints – not necessarily with the same planned duration from one sprint to another as the sprint duration is driven by the machine learning process.
- The backlog stories are also necessarily impacted by the machine learning process. For instance, some stories should correspond to the construction of the features. Clearly not end-user oriented stories.
- The backlog even at epic level cannot be completed at the beginning of the project as the number of machine learning improvement sprints to be executed is not known when the project starts.
- During iteration 0, story mapping based on business processes may be replaced by story mapping based on the machine learning process
- There is no reason to say that other Scrum elements are not useful such as scrum master, data science product owner, development team, backlog, definition of done, daily scrum, retrospective…
It can even be said that Scrum is particularly relevant for data science projects even if some aspects need to be adjusted (very short time-boxed sprints - not necessarily with the same planned duration, for example). In Short! No pure Scrum but ScrumBut!