blog

Dries Buytaert wrote last week about intending to use social media less in 2018. As an entrepreneur developing a CMS, he has a vested interest in preventing the world moving to see the internet as being either Facebook, Instagram or Twitter (or reversing that current-state maybe). Still, I believe he is genuinely concerned about the effect of using social media on our thinking. This partly because I share the observation. Despite having been an early adopter, I disabled my Facebook account a year or two ago already. I'm currently in doubt whether I should not do the same with Twitter. I notice it actually is not as good a source of news as classic news sites - headlines simply get repeated numerous times when major events happen, and other news is equally easily noticed browsing a traditional website. Fringe and mainstream thinkers alike in the space of management, R stats, computing hardware etc are a different matter. While, as Dries notices, their micro-messages are typically not well worked out, they do make me aware of what they have blogged about - for those that actually still blog. So is it a matter of trying to increase my Nexcloud newsreader use, maybe during dedicated reading time, and no longer opening the Twitter homepage on my phone at random times throughout the day, and conceding short statements without a more worked out bit of content behind it are not all that useful?

The above focuses on consuming content of others. To foster conversations, which arguably is the intent of social media too, we might need something like webmentions to pick up steam too.

Posted Mon Jan 8 21:04:09 2018 Tags:

The Internet Archive contains a dataset from the NYC Taxi and Limousine Commission, obtained under a FOIA request. It includes a listing of each taxi ride in 2013, its number of passengers, distance covered, start and stop locations and more.

The dataset is a wopping 3.9 GB compressed, or shy of 30 GB uncompressed. As such, it is quite unwieldy in R.

As I was interested in summarised data for my first analysis, I decided to load the CSV files in a SQLite database, query it using SQL and storing the resulting output as CSV file again - far smaller though, as I only needed 2 columns for each day of the 1 year of data.

The process went as follows.

First extract the CSV file from the 7z compressed archive.

7z e ../trip_data.7z trip_data_1.csv

and the same for the other months. (As I was running low on disk space, I had to do 2 months at a time only.) Next, import it in a SQLite db.

echo -e '.mode csv \n.import trip_data_1.csv trips2013' | sqlite3 NYCtaxi.db

Unfortunately the header row separates with ", ", and column names now start with a space. This does not happen when importing in the sqlite3 command line - tbd why. As a result, those column names need to be quoted in the query below.

Repeat this import for all the months - as mentioned, I did 2 at time.

Save the output we need in temporary csv files:

sqlite3 -header -csv trips2013.db 'select DATE(" pickup_datetime"), count(" passenger_count") AS rides, sum(" passenger_count") AS passengers from trips2013 GROUP BY DATE(" pickup_datetime");' > 01-02.csv

Remove the archives and repeat:

rm trip_data_?.csv
rm trips2013.db

Next, I moved on to the actual analysis work in R.

Looking at the number of trips per day on a calendar heatmap reveals something odd - the first week of August has very few rides compared to any other week. While it's known people in NY tend to leave the city in August, this drop is odd.

Calendar heatmap of trips

Deciding to ignore August altogether, and zooming in on occupancy rate of the taxis rather than the absolute number or rides, reveals an interesting insight - people travel together far more in weekends and on public holidays!

Occupancy heatmap

Just looking at the calendar heatmap it's possible to determine 1 Jan 2013 was a Tuesday and point out Memorial Day as the last Monday of May, Labour day in September, Thanksgiving day and even Black Friday at the end of November, and of course the silly season at the end of the year!)

The dataset contains even more interesting information in its geo-location columns I imagine!

Posted Thu Nov 30 21:36:05 2017 Tags:

Trying to plot the income per capita in Australia on a map, I came across a perfectly good reason to make good use of a spatial query in R.

I had to combine a shapefile of Australian SA3's, a concept used under the Australian Statistical Geography Standard meaning Statistical Area Level 3, with a dataset of income per postal code. I created a matrix of intersecting postal codes and SA3's, and obtained the desired income per capita by SA3 performing a matrix multiplication. If the geographical areas were perfectly alignable, using a function like st_contains would have been preferred. Now I fell back on using st_intersects, which results in possibly assigning a postal code to 2 different statistical areas. Alternative approaches are welcome in the comments!

As Australia is so vast, and the majority of its people are earning a living in a big city, a full map does not show the difference in income per area at this level of detail. Instead, I opted to map some of the key cities in a single image.

Income distribution in major AU cities

The full code is available on my git server for you to clone using git clone git://git.vanrenterghem.biz/R/project-au-taxstats.git.

Posted Thu Nov 16 15:07:59 2017 Tags:

In the Northern hemisphere, it's commonly said women prefer to give birth around summer. It would appear this does not hold for Australia. The graph below actually suggests most babies are conceived over the summer months (December to February) down under!

seasonal subseries plot Australian births by month 1996-2014

In preparing the graph above (a "seasonal subseries plot"), I could not help but notice the spike in the numbers for each month around 2005. It turns out that was real - Australia did experience a temporary increase in its fertility rate. Whether that was thanks to government policy (baby bonus, tax subsidies) or other causes is not known.

Full R code is on my git server. Check it out - there are a few more plots in there already. I might write about these later.

Posted Thu Oct 26 13:06:27 2017 Tags:

How cool is this? A map of Western Australia with all state roads marked in only 5 lines of R!

WARoads <- st_read(dsn = "data/", layer = "RoadNetworkMRWA_514", stringsAsFactors = FALSE)
WALocalities <- st_read(dsn = "data/", layer = "WA_LOCALITY_POLYGON_shp", stringsAsFactors = FALSE)
ggplot(WALocalities) +
  geom_sf() +
  geom_sf(data = dplyr::filter(WARoads, network_ty == "State Road"), colour = "red")

Map of WA state roads

Courtesy of the development version of ggplot2 - geom_sf is not yet available in the version on CRAN.

Posted Tue Oct 17 20:35:40 2017 Tags:

Turns out it is possible, thanks to the good folks at Stamen Design, to get fairly unobtrusive maps based on the OpenStreetMap data.

Combining this with a SLIP dataset from the Western Australian government on active schools in the state provided a good opportunity to check out the recently released sf (Simple Features) package in R.

Simple features are a standardized way to encode spatial vector data. Support in R was added in November 2016. For users of the tidyverse, this makes manipulating shapefiles and the likes easier, as simple features in R are dataframes!

Plot of secondary schools by student population

Full details in git.

Posted Thu Oct 12 21:30:51 2017 Tags:

Road fatalities in Australia

Recently inspired to doing a little analysis again, I landed on a dataset from https://bitre.gov.au/statistics/safety/fatal_road_crash_database.aspx, which I downloaded on 5 Oct 2017. Having open datasets for data is a great example of how governments are moving with the times!

Trends

I started by looking at the trends - what is the approximate number of road fatalities a year, and how is it evolving over time? Are there any differences noticeable between states? Or by gender?

Overall trend lineTrend lines by Australian stateTrend lines by gender

What age group is most at risk in city traffic?

Next, I wondered if there were any particular ages that were more at risk in city traffic. I opted to quickly bin the data to produce a histogram.

fatalities %>%
  filter(Year != 2017, Speed_Limit <= 50) %>%
  ggplot(aes(x=Age))+
  geom_histogram(binwidth = 5) +
  labs(title = "Australian road fatalities by age group",
       y = "Fatalities") +
  theme_economist()

## Warning: Removed 2 rows containing non-finite values (stat_bin).

histogram

Hypothesis

Based on the above, I wondered - are people above 65 more likely to die in slow traffic areas? To make this a bit easier, I added two variables to the dataset - one splitting people in younger and older than 65, and one based on the speed limit in the area of the crash being under or above 50 km per hour - city traffic or faster in Australia.

fatalities.pensioners <- fatalities %>%
  filter(Speed_Limit <= 110) %>% # less than 2% has this - determine why
  mutate(Pensioner = if_else(Age >= 65, TRUE, FALSE)) %>%
  mutate(Slow_Traffic = ifelse(Speed_Limit <= 50, TRUE, FALSE)) %>%
  filter(!is.na(Pensioner))

To answer the question, I produce a density plot and a boxplot.

density plotbox plot

Some further statistical analysis does confirm the hypothesis!

# Build a contingency table and perform prop test
cont.table <- table(select(fatalities.pensioners, Slow_Traffic, Pensioner))
cont.table

##             Pensioner
## Slow_Traffic FALSE  TRUE
##        FALSE 36706  7245
##        TRUE   1985   690

prop.test(cont.table)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  cont.table
## X-squared = 154.11, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07596463 0.11023789
## sample estimates:
##    prop 1    prop 2 
## 0.8351573 0.7420561

# Alternative approach to using prop test
pensioners <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE, Pensioner == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE, Pensioner == TRUE)))
everyone <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE)))
prop.test(pensioners,everyone)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  pensioners out of everyone
## X-squared = 154.11, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07596463 0.11023789
## sample estimates:
##    prop 1    prop 2 
## 0.2579439 0.1648427

Conclusion

It's possible to conclude older people are over-represented in the fatalities in lower speed zones. Further ideas for investigation are understanding the impact of the driving age limit on the fatalities - the position in the car of the fatalities (driver or passenger) was not yet considered in this quick look at the contents of the dataset.

quantile-quantile plot

Posted Tue Oct 10 16:56:56 2017 Tags:

Using Markov chains' transition matrices to model the movement of loans from being opened (in a state of "Current") to getting closed can misinform the user at times.

To illustrate the challenge, the graph below plots the evolution, from the original state to the final state, of a group of loans over 6 periods of time.

Actual vs predicted loan vintage performance.

The solid lines are the result of applying an average transition matrix 6 times (the model's predicted outcome). The dashed lines are the actual observed results for a set of loans.

As can be seen, the model does not do a very good job at predicting the accounts that will end up in state "Closed" in each period. They end up in a different state between Current and Closed (i.e. overdue) at a higher than expected rate. Why is that?

The prediction was built using an average of the transition matrix of a number of consecutive period statetables for a book of loans. That book was not homogenic though. Most obviously, the "Current" accounts were not of the same vintage - some had been in that state for a number of periods before already. The observed set of loans all originated in the same period. Other differences can be related to client demographics, loan characteristics or macro-economic circumstances.

Applying a transition matrix based on a group of loans of various vintages to a group of loans that all were new entrants in the book violates the often implied Markov chain assumption of time-homogenity.

What that assumption says is that the future state is independent of the past state.

Loans typically have a varying chance of becoming delinquent in function of how long they have been open already.

Multi-order Markov chains are those that depend on a number (the order) of states in the past. The question becomes - what order is the Markov chain? Otherwise put, how many previous periods need to be taken into account to be able to accurately estimate the next period's statetable? Controlling for the other differences suggested above, if found to be material, may be important as well.

Posted Fri Feb 17 20:07:10 2017 Tags:

The strenght of a predictive, machine-learning model is often evaluated by quoting the area under the curve or AUC (or similarly the Gini coefficient). This AUC represents the area under the ROC line, which shows the trade-off between false positives and true positives for different cutoff values. Cutoff values enable the use of a regression model for classification purposes, by marking the value below and above which either of the classifier values is predicted. Models with a higher AUC (or a higher Gini coefficient) are considered better.

This oversimplifies the challenge facing any real-world model builder. The diagonal line from (0,0) to (1,1) is a first hint at that. Representing a model randomly guessing, this model with an AUC of .5 is effectively worth nothing. Now assume a model with the same AUC, but for a certain range of cutoffs its curve veers above the diagonal, and for another it veers below it.

Such a model may very well have some practical use. This can be determined by introducing an indifference line to the ROC analysis. The upper-left area of the ROC space left by that line is where the model makes economical sense to use.

The slope of the line (s) is defined mathematically as follows:

slope s = (ratio negative * (utility TN - utility FP)) / (ratio positive * (utility TP - utility FN))

This with ratio negative the base rate of negative outcomes, utility TN the economic value of identifying a true negative, and so on.

Many such lines can be drawn on any square space - the left-most one crossing either (0,0) or (1,1) is the one we care about.

This line represents combinations of true positive rates and false positive rates that have the same utility to the user. In the event of equal classes and equal utilities, this line is the diagonal of the random model.

ROC space plot with indifference line.

An optimal and viable cutoff is the point of the tangent of the left-most parallel line to the indifference line and the ROC curve.

The code to create a graphic like above is shown below. Of note is the conversion to coord_fixed which ensures the plot is actually a square as intended.

library(ggplot2)
library(dplyr)
r.p <- nrow(filter(dataset, y == 'Positive')) / nrow(dataset)
r.n <- 1- r.p
uFP <- -10 
uFN <- -2
uTP <- 20
uTN <- 0
s <- (r.n * (uTN - uFP)) / (r.p * (uTP - uFN)) # equals .4 
ROC.plot + # start from a previous plot with the ROC space
  coord_fixed() + # Fix aspect ratio - allows to convert slope to angle and also better for plotted data
  geom_abline(intercept = ifelse(s < 1, 1-s, 0), slope = s, colour = "blue") + 
  annotate("text", x = 0.05, y = ifelse(s < 1, 1 - s -.01, 0), angle = atan(s) * 180/pi, label = "Indifference line", hjust = 0, colour = "blue")

Reference article

Posted Tue Jan 10 23:12:52 2017 Tags:

It is useful to apply the concepts from survival data analysis in a fintech environment. After all, there will usually be a substantial amount of time-to-event data to choose from. This can be website visitors leaving the site, loans being repaid early, clients becoming delinquent - the options are abound.

A visual analysis of such data can easily be obtained using R.

library(survminer)
library(survival)
library(KMSurv)
## Create survival curve from a survival object
#' Status is 1 if the event was observed at TimeAfterStart
#' It is set to 0 to mark the right-censored time
vintage.survival <- survfit(Surv(TimeAfterStart,Status) ~ Vintage, data = my.dataset)
## Generate cumulative incidence plot
ci.plot <- ggsurvplot(vintage.survival,
           fun = function(y) 1-y,
           censor = FALSE,
           conf.int = FALSE,
           ylab = 'Ratio event observed',
           xlab = 'Time after open',
           break.time.by = 30,
           legend = "bottom",
           legend.title = "",
           risk.table = TRUE,
           risk.table.title = 'Number of group',
           risk.table.col = "black",
           risk.table.fontsize = 4,
           risk.table.height = 0.35
           )

This produces a plot with a survival curve per group, and also includes the risk table. This table shows how many members of the group for whom no event was observed are still being followed at each point in time. Labelling these "at risk" stems of course from the original concept of survival analysis, where the event typically is the passing of the subject.

The fun = function(y) 1-y part actually reverses the curve, resulting in what is known as a cumulative incidence curve.

Survival/incidence curve and risk table

Underneath the plot, a risk table is added with no effort by adding risk.table = TRUE as parameter for ggsurvplot.

Checking the trajectory of these curves for different groups of customers (with a different treatment plan, to stick to the terminology) is an easy way to verify whether actions are having the expected result.

Posted Sat Dec 10 15:49:02 2016 Tags:

This blog is powered by ikiwiki.