blog

It appears to me the cross-industry standard process for data mining (CRISP-DM) is still, almost a quarter century after first having been formulated, a valuable framework to guide management of a data science team. Start with building business understanding, followed by understanding the data, preparing it, moving from modeling to solve the problem over to evaluating the model and ending by deploying it. The framework is iterative, and allows for back-and-forth between these steps based on what's learned in the later steps.

CRISP-DM

It doesn't put too great an emphasis on scheduling the activities, but focuses on the value creation.

The Observe-Orient-Decide-Act (OODA) loop from John Boyd seems to be an analogue concept. Competing businesses would then be advised to speed up their cycling through the CRISP-DM loop, as that's how Boyd stated advantage is obtained - by cycling through the OODA loops more quickly than ones opponent. Most interestingly, in both loops it's a common pitfall to skip the last step - deploying the model / acting.

OODA loop

(Image by Patrick Edwin Moran - Own work, CC BY 3.0)

Posted Tue Jun 19 20:51:36 2018 Tags:

I have been asked a few times recently about my management style. First, while applying for a position myself. Next, less expected, by a member of the org I joined as well as by a candidate I interviewed for a position in the team.

My answer was not very concise, as I lacked the framework knowledge to be so.

Today, I believe to have stumbled on a description of the style I practice (or certainly aim to) most often on Adam Drake's blog. Its name? Mission Command. (The key alternative being detailed command.)

Now this is an interesting revelation for more than one reason. I consider it a positive thing I can now more clearly articulate how I naturally tend to work as a team leader. It now becomes clear too what is important to me, by reviewing the key principles:

  • Build cohesive teams through mutual trust.
  • Create shared understanding.
  • Provide a clear commander’s intent.
  • Exercise disciplined initiative.
  • Use mission orders.
  • Accept prudent risk.

Reviewing these principles in detail, this style of leadership should not be mistaken for laissez-faire. Providing clear commander's intent, creating shared understanding, using mission orders are very active principles for the leader. For the subordinate, the need to exercise disciplined initiative is clearly also not a free-for-all. The need for mutual trust for this to work cannot be emphasised enough.

Posted Wed Feb 14 15:38:33 2018 Tags:

Dries Buytaert wrote last week about intending to use social media less in 2018. As an entrepreneur developing a CMS, he has a vested interest in preventing the world moving to see the internet as being either Facebook, Instagram or Twitter (or reversing that current-state maybe). Still, I believe he is genuinely concerned about the effect of using social media on our thinking. This partly because I share the observation. Despite having been an early adopter, I disabled my Facebook account a year or two ago already. I'm currently in doubt whether I should not do the same with Twitter. I notice it actually is not as good a source of news as classic news sites - headlines simply get repeated numerous times when major events happen, and other news is equally easily noticed browsing a traditional website. Fringe and mainstream thinkers alike in the space of management, R stats, computing hardware etc are a different matter. While, as Dries notices, their micro-messages are typically not well worked out, they do make me aware of what they have blogged about - for those that actually still blog. So is it a matter of trying to increase my Nexcloud newsreader use, maybe during dedicated reading time, and no longer opening the Twitter homepage on my phone at random times throughout the day, and conceding short statements without a more worked out bit of content behind it are not all that useful?

The above focuses on consuming content of others. To foster conversations, which arguably is the intent of social media too, we might need something like webmentions to pick up steam too.

Posted Mon Jan 8 21:04:09 2018 Tags:

The Internet Archive contains a dataset from the NYC Taxi and Limousine Commission, obtained under a FOIA request. It includes a listing of each taxi ride in 2013, its number of passengers, distance covered, start and stop locations and more.

The dataset is a wopping 3.9 GB compressed, or shy of 30 GB uncompressed. As such, it is quite unwieldy in R.

As I was interested in summarised data for my first analysis, I decided to load the CSV files in a SQLite database, query it using SQL and storing the resulting output as CSV file again - far smaller though, as I only needed 2 columns for each day of the 1 year of data.

The process went as follows.

First extract the CSV file from the 7z compressed archive.

7z e ../trip_data.7z trip_data_1.csv

and the same for the other months. (As I was running low on disk space, I had to do 2 months at a time only.) Next, import it in a SQLite db.

echo -e '.mode csv \n.import trip_data_1.csv trips2013' | sqlite3 NYCtaxi.db

Unfortunately the header row separates with ", ", and column names now start with a space. This does not happen when importing in the sqlite3 command line - tbd why. As a result, those column names need to be quoted in the query below.

Repeat this import for all the months - as mentioned, I did 2 at time.

Save the output we need in temporary csv files:

sqlite3 -header -csv trips2013.db 'select DATE(" pickup_datetime"), count(" passenger_count") AS rides, sum(" passenger_count") AS passengers from trips2013 GROUP BY DATE(" pickup_datetime");' > 01-02.csv

Remove the archives and repeat:

rm trip_data_?.csv
rm trips2013.db

Next, I moved on to the actual analysis work in R.

Looking at the number of trips per day on a calendar heatmap reveals something odd - the first week of August has very few rides compared to any other week. While it's known people in NY tend to leave the city in August, this drop is odd.

Calendar heatmap of trips

Deciding to ignore August altogether, and zooming in on occupancy rate of the taxis rather than the absolute number or rides, reveals an interesting insight - people travel together far more in weekends and on public holidays!

Occupancy heatmap

Just looking at the calendar heatmap it's possible to determine 1 Jan 2013 was a Tuesday and point out Memorial Day as the last Monday of May, Labour day in September, Thanksgiving day and even Black Friday at the end of November, and of course the silly season at the end of the year!)

The dataset contains even more interesting information in its geo-location columns I imagine!

Posted Thu Nov 30 21:36:05 2017 Tags:

Trying to plot the income per capita in Australia on a map, I came across a perfectly good reason to make good use of a spatial query in R.

I had to combine a shapefile of Australian SA3's, a concept used under the Australian Statistical Geography Standard meaning Statistical Area Level 3, with a dataset of income per postal code. I created a matrix of intersecting postal codes and SA3's, and obtained the desired income per capita by SA3 performing a matrix multiplication. If the geographical areas were perfectly alignable, using a function like st_contains would have been preferred. Now I fell back on using st_intersects, which results in possibly assigning a postal code to 2 different statistical areas. Alternative approaches are welcome in the comments!

As Australia is so vast, and the majority of its people are earning a living in a big city, a full map does not show the difference in income per area at this level of detail. Instead, I opted to map some of the key cities in a single image.

Income distribution in major AU cities

The full code is available on my git server for you to clone using git clone git://git.vanrenterghem.biz/R/project-au-taxstats.git.

Posted Thu Nov 16 15:07:59 2017 Tags:

In the Northern hemisphere, it's commonly said women prefer to give birth around summer. It would appear this does not hold for Australia. The graph below actually suggests most babies are conceived over the summer months (December to February) down under!

seasonal subseries plot Australian births by month 1996-2014

In preparing the graph above (a "seasonal subseries plot"), I could not help but notice the spike in the numbers for each month around 2005. It turns out that was real - Australia did experience a temporary increase in its fertility rate. Whether that was thanks to government policy (baby bonus, tax subsidies) or other causes is not known.

Full R code is on my git server. Check it out - there are a few more plots in there already. I might write about these later.

Posted Thu Oct 26 13:06:27 2017 Tags:

How cool is this? A map of Western Australia with all state roads marked in only 5 lines of R!

WARoads <- st_read(dsn = "data/", layer = "RoadNetworkMRWA_514", stringsAsFactors = FALSE)
WALocalities <- st_read(dsn = "data/", layer = "WA_LOCALITY_POLYGON_shp", stringsAsFactors = FALSE)
ggplot(WALocalities) +
  geom_sf() +
  geom_sf(data = dplyr::filter(WARoads, network_ty == "State Road"), colour = "red")

Map of WA state roads

Courtesy of the development version of ggplot2 - geom_sf is not yet available in the version on CRAN.

Posted Tue Oct 17 20:35:40 2017 Tags:

Turns out it is possible, thanks to the good folks at Stamen Design, to get fairly unobtrusive maps based on the OpenStreetMap data.

Combining this with a SLIP dataset from the Western Australian government on active schools in the state provided a good opportunity to check out the recently released sf (Simple Features) package in R.

Simple features are a standardized way to encode spatial vector data. Support in R was added in November 2016. For users of the tidyverse, this makes manipulating shapefiles and the likes easier, as simple features in R are dataframes!

Plot of secondary schools by student population

Full details in git.

Posted Thu Oct 12 21:30:51 2017 Tags:

Road fatalities in Australia

Recently inspired to doing a little analysis again, I landed on a dataset from https://bitre.gov.au/statistics/safety/fatal_road_crash_database.aspx, which I downloaded on 5 Oct 2017. Having open datasets for data is a great example of how governments are moving with the times!

Trends

I started by looking at the trends - what is the approximate number of road fatalities a year, and how is it evolving over time? Are there any differences noticeable between states? Or by gender?

Overall trend lineTrend lines by Australian stateTrend lines by gender

What age group is most at risk in city traffic?

Next, I wondered if there were any particular ages that were more at risk in city traffic. I opted to quickly bin the data to produce a histogram.

fatalities %>%
  filter(Year != 2017, Speed_Limit <= 50) %>%
  ggplot(aes(x=Age))+
  geom_histogram(binwidth = 5) +
  labs(title = "Australian road fatalities by age group",
       y = "Fatalities") +
  theme_economist()

## Warning: Removed 2 rows containing non-finite values (stat_bin).

histogram

Hypothesis

Based on the above, I wondered - are people above 65 more likely to die in slow traffic areas? To make this a bit easier, I added two variables to the dataset - one splitting people in younger and older than 65, and one based on the speed limit in the area of the crash being under or above 50 km per hour - city traffic or faster in Australia.

fatalities.pensioners <- fatalities %>%
  filter(Speed_Limit <= 110) %>% # less than 2% has this - determine why
  mutate(Pensioner = if_else(Age >= 65, TRUE, FALSE)) %>%
  mutate(Slow_Traffic = ifelse(Speed_Limit <= 50, TRUE, FALSE)) %>%
  filter(!is.na(Pensioner))

To answer the question, I produce a density plot and a boxplot.

density plotbox plot

Some further statistical analysis does confirm the hypothesis!

# Build a contingency table and perform prop test
cont.table <- table(select(fatalities.pensioners, Slow_Traffic, Pensioner))
cont.table

##             Pensioner
## Slow_Traffic FALSE  TRUE
##        FALSE 36706  7245
##        TRUE   1985   690

prop.test(cont.table)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  cont.table
## X-squared = 154.11, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07596463 0.11023789
## sample estimates:
##    prop 1    prop 2 
## 0.8351573 0.7420561

# Alternative approach to using prop test
pensioners <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE, Pensioner == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE, Pensioner == TRUE)))
everyone <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE)))
prop.test(pensioners,everyone)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  pensioners out of everyone
## X-squared = 154.11, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07596463 0.11023789
## sample estimates:
##    prop 1    prop 2 
## 0.2579439 0.1648427

Conclusion

It's possible to conclude older people are over-represented in the fatalities in lower speed zones. Further ideas for investigation are understanding the impact of the driving age limit on the fatalities - the position in the car of the fatalities (driver or passenger) was not yet considered in this quick look at the contents of the dataset.

quantile-quantile plot

Posted Tue Oct 10 16:56:56 2017 Tags:

Using Markov chains' transition matrices to model the movement of loans from being opened (in a state of "Current") to getting closed can misinform the user at times.

To illustrate the challenge, the graph below plots the evolution, from the original state to the final state, of a group of loans over 6 periods of time.

Actual vs predicted loan vintage performance.

The solid lines are the result of applying an average transition matrix 6 times (the model's predicted outcome). The dashed lines are the actual observed results for a set of loans.

As can be seen, the model does not do a very good job at predicting the accounts that will end up in state "Closed" in each period. They end up in a different state between Current and Closed (i.e. overdue) at a higher than expected rate. Why is that?

The prediction was built using an average of the transition matrix of a number of consecutive period statetables for a book of loans. That book was not homogenic though. Most obviously, the "Current" accounts were not of the same vintage - some had been in that state for a number of periods before already. The observed set of loans all originated in the same period. Other differences can be related to client demographics, loan characteristics or macro-economic circumstances.

Applying a transition matrix based on a group of loans of various vintages to a group of loans that all were new entrants in the book violates the often implied Markov chain assumption of time-homogenity.

What that assumption says is that the future state is independent of the past state.

Loans typically have a varying chance of becoming delinquent in function of how long they have been open already.

Multi-order Markov chains are those that depend on a number (the order) of states in the past. The question becomes - what order is the Markov chain? Otherwise put, how many previous periods need to be taken into account to be able to accurately estimate the next period's statetable? Controlling for the other differences suggested above, if found to be material, may be important as well.

Posted Fri Feb 17 20:07:10 2017 Tags:

This blog is powered by ikiwiki.