blog

Using Markov chains' transition matrices to model the movement of loans from being opened (in a state of "Current") to getting closed can misinform the user at times.

To illustrate the challenge, the graph below plots the evolution, from the original state to the final state, of a group of loans over 6 periods of time.

Actual vs predicted loan vintage performance.

The solid lines are the result of applying an average transition matrix 6 times (the model's predicted outcome). The dashed lines are the actual observed results for a set of loans.

As can be seen, the model does not do a very good job at predicting the accounts that will end up in state "Closed" in each period. They end up in a different state between Current and Closed (i.e. overdue) at a higher than expected rate. Why is that?

The prediction was built using an average of the transition matrix of a number of consecutive period statetables for a book of loans. That book was not homogenic though. Most obviously, the "Current" accounts were not of the same vintage - some had been in that state for a number of periods before already. The observed set of loans all originated in the same period. Other differences can be related to client demographics, loan characteristics or macro-economic circumstances.

Applying a transition matrix based on a group of loans of various vintages to a group of loans that all were new entrants in the book violates the often implied Markov chain assumption of time-homogenity.

What that assumption says is that the future state is independent of the past state.

Loans typically have a varying chance of becoming delinquent in function of how long they have been open already.

Multi-order Markov chains are those that depend on a number (the order) of states in the past. The question becomes - what order is the Markov chain? Otherwise put, how many previous periods need to be taken into account to be able to accurately estimate the next period's statetable? Controlling for the other differences suggested above, if found to be material, may be important as well.

Posted Fri 17 Feb 2017 20:07:10 AWST Tags:

The strenght of a predictive, machine-learning model is often evaluated by quoting the area under the curve or AUC (or similarly the Gini coefficient). This AUC represents the area under the ROC line, which shows the trade-off between false positives and true positives for different cutoff values. Cutoff values enable the use of a regression model for classification purposes, by marking the value below and above which either of the classifier values is predicted. Models with a higher AUC (or a higher Gini coefficient) are considered better.

This oversimplifies the challenge facing any real-world model builder. The diagonal line from (0,0) to (1,1) is a first hint at that. Representing a model randomly guessing, this model with an AUC of .5 is effectively worth nothing. Now assume a model with the same AUC, but for a certain range of cutoffs its curve veers above the diagonal, and for another it veers below it.

Such a model may very well have some practical use. This can be determined by introducing an indifference line to the ROC analysis. The upper-left area of the ROC space left by that line is where the model makes economical sense to use.

The slope of the line (s) is defined mathematically as follows:

slope s = (ratio negative * (utility TN - utility FP)) / (ratio positive * (utility TP - utility FN))

This with ratio negative the base rate of negative outcomes, utility TN the economic value of identifying a true negative, and so on.

Many such lines can be drawn on any square space - the left-most one crossing either (0,0) or (1,1) is the one we care about.

This line represents combinations of true positive rates and false positive rates that have the same utility to the user. In the event of equal classes and equal utilities, this line is the diagonal of the random model.

ROC space plot with indifference line.

An optimal and viable cutoff is the point of the tangent of the left-most parallel line to the indifference line and the ROC curve.

The code to create a graphic like above is shown below. Of note is the conversion to coord_fixed which ensures the plot is actually a square as intended.

library(ggplot2)
library(dplyr)
r.p <- nrow(filter(dataset, y == 'Positive')) / nrow(dataset)
r.n <- 1- r.p
uFP <- -10 
uFN <- -2
uTP <- 20
uTN <- 0
s <- (r.n * (uTN - uFP)) / (r.p * (uTP - uFN)) # equals .4 
ROC.plot + # start from a previous plot with the ROC space
  coord_fixed() + # Fix aspect ratio - allows to convert slope to angle and also better for plotted data
  geom_abline(intercept = ifelse(s < 1, 1-s, 0), slope = s, colour = "blue") + 
  annotate("text", x = 0.05, y = ifelse(s < 1, 1 - s -.01, 0), angle = atan(s) * 180/pi, label = "Indifference line", hjust = 0, colour = "blue")

Reference article

Posted Tue 10 Jan 2017 23:12:52 AWST Tags:

It is useful to apply the concepts from survival data analysis in a fintech environment. After all, there will usually be a substantial amount of time-to-event data to choose from. This can be website visitors leaving the site, loans being repaid early, clients becoming delinquent - the options are abound.

A visual analysis of such data can easily be obtained using R.

library(survminer)
library(survival)
library(KMSurv)
## Create survival curve from a survival object
#' Status is 1 if the event was observed at TimeAfterStart
#' It is set to 0 to mark the right-censored time
vintage.survival <- survfit(Surv(TimeAfterStart,Status) ~ Vintage, data = my.dataset)
## Generate cumulative incidence plot
ci.plot <- ggsurvplot(vintage.survival,
           fun = function(y) 1-y,
           censor = FALSE,
           conf.int = FALSE,
           ylab = 'Ratio event observed',
           xlab = 'Time after open',
           break.time.by = 30,
           legend = "bottom",
           legend.title = "",
           risk.table = TRUE,
           risk.table.title = 'Number of group',
           risk.table.col = "black",
           risk.table.fontsize = 4,
           risk.table.height = 0.35
           )

This produces a plot with a survival curve per group, and also includes the risk table. This table shows how many members of the group for whom no event was observed are still being followed at each point in time. Labelling these "at risk" stems of course from the original concept of survival analysis, where the event typically is the passing of the subject.

The fun = function(y) 1-y part actually reverses the curve, resulting in what is known as a cumulative incidence curve.

Survival/incidence curve and risk table

Underneath the plot, a risk table is added with no effort by adding risk.table = TRUE as parameter for ggsurvplot.

Checking the trajectory of these curves for different groups of customers (with a different treatment plan, to stick to the terminology) is an easy way to verify whether actions are having the expected result.

Posted Sat 10 Dec 2016 15:49:02 AWST Tags:

A lot of information on knitr is centered around using it for reproducible research. I've found it to be a nice way to make abstraction of mundane reporting though. It is as easy as performing the necessary data extraction and manipulation in an R script, including the creation of tables and graphs.

To develop the report template, simply source the R script within an Rmd one, per the example template below:

---
title: "My report"
date: "`r Sys.time()`" 
output: pdf_document
---

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
source('my-report.R')
```

Include some text about your report.

##Add a title.

Some further text.

```{r, echo=FALSE}
plot(my-plot-object)
kable(my-df-or-table)
```

When you are ready to create the report, the convenience of RMarkdown is hard to beat:

R -e "rmarkdown::render('~/my-report.Rmd',output_file='~/my-report.pdf')"

Thanks to the YAML header at the start of the report template, information like the report's title and target output format don't need to be mentioned. This command can easily be scripted a bit further to include a date-time stamp in the output filename for instance, and scheduled using cron.

Posted Mon 10 Oct 2016 21:02:06 AWST Tags:

Getting used to the grammar of ggplot2 takes some time, but so far it's not been disappointing. Wanting to split a scatterplot by segment, I used facet_grid. That by default shows a label on each subplot, using the values in the variable by which the plot is faceted.

As that often isn't very descriptive in itself, there needs to be a way to re-label these subplots. That way is as_labeller, as shown in the example code below.

Example:

ggplot(outputs, aes(x=date_var,y=value_var),alpha=0.8) +
  geom_point(aes(y=value_var, colour=colour_var)) +
  geom_smooth() +
  theme(legend.position = "none",axis.text.y = element_blank(),axis.text.x = element_blank()) +
  scale_x_date(date_breaks = '1 week') +
  labs(y = "Value",x = "Date", title = "Example") +
  scale_colour_manual("Legend",values=named_coloring_vector)) +
  scale_fill_manual("",values=c("grey12")) +
  facet_grid(. ~ Segment, labeller = as_labeller(c("yes" = "Segment A",
                                                          "no" = "Segment B")))

Output: Example plot with 2 facets labelled Segment B and Segment A

Posted Wed 05 Oct 2016 21:48:11 AWST Tags:

For unknown reasons, the Music application on my Nokia N9 does not always display the album cover where expected. Instead, it displays the artist name and album title. Reports by other users of this phone suggest this isn't an uncommon issue, but offer no confirmed insight in the root cause of the problem unfortunately.

Fortunately, the symptoms of this problem are relatively easy to fix on a one-by-one basis.

In ~/.cache/media-art on the phone, copy the album art (in a JPEG file) to a file named using the following format:

album-$(echo -n "artist name" | md5sum | cut -d ' ' -f 1)-$(echo -n "album name" | md5sum | cut -d ' ' -f 1).jpeg

Replace artist name and album name with the appropriate values for the album, in small caps (lowercase).

This follows the Media Art Storage Spec

Luckily, in most cases the above is not necessary and it suffices to store the cover picture as cover.jpg in the album's directory in ~/MyDocs/Music.

Posted Mon 03 Oct 2016 21:33:50 AWST

Where my first R package was more a proof-of-concept, I now certainly left the beginneRs group by publishing an R package to a private Github repository at work. I used that package in some R scripts performing various real-time analysis of our operations already. This by loading it through the devtools package, and specifically the install_github() function.

Next, I will have to check the install_url() function, as I had not quite figured out at the time of writing my initial R package how I could actually use it in a script without manually installing it first.

The ability to script regular reporting and publish results as graphs and tables and publishing these in emails (opening a gateway into for instance Slack) or in Excel files, is very empowering. To an extent, I used to do this using VBA some years ago. Doing that in an integrated way with various datasources required a lot more work though, certainly given how the MS Windows environment until not so long ago lacked decent support for scripted operations for anyone but in-the-know IT professionals.

Posted Fri 23 Sep 2016 19:47:35 AWST

I have learned the hard way it is important to be aware that

Type-handling is a rather complex issue, especially with JDBC as different databases support different data types. RJDBC attempts to simplify this issue by internally converting all data types to either character or numeric values.

Source

This because RODBC does not have the same behaviour.

When switching a few R scripts over from using RJDBC to access a MS SQL Server database to RODBC, I ran into some odd problems.

First, I noticed as.Date(query,output$datecolumn) resulted in what looked like 2016-06-21 becoming 2016-06-22. That's right, R started adding a day to the date. as.Date(strptime(query.output$datecolumn, "%Y-%m-%d")) put a stop to that madness.

Another problem had to do with an XML value being returned by a query. The application generating that XML for some reason opts to not store it as an XML data type but instead uses a varchar. That makes it is very hard to use XQuery, so I had opted to do the hard work in R by taking the whole XML value into R - despite this making the retrieval of query results almost impossible. In order to convert that column to an XML data type in R, I was able to do sapply(response.xml$response, xmlParse) on the output of a SQL query using RJDBC. Once the output from the RODBC connection had to be processed, this needed to become sapply(response.xml$response, xmlParse, asText = TRUE). It is interesting this wasn't needed for the RJDBC output.

So yes, type-handling is a rather complex issue.

Posted Fri 24 Jun 2016 14:17:16 AWST

Trying to configure obnam to use one repository for 3 clients using encryption has been a bit of search.

Initialising the first client was straightforward. I simply set it up to use a gpg key for encryption per the manual. Since that key is only used for encrypting backups from this client, making it not have a passphrase seemed to be a good option.

For the next client, things got a bit trickier. Since the backup repository is now encrypted, that client couldn't access it. The solution I ended up with was to temporarily ensure client 2 has access to client 1's secret key too.

On client 1: gpg --export-secret-key -a LONG_KEY > client1.private.key

That file I had to copy to the other client, and import it using:

On client 2: gpg --import client1.private.key

Now I could configure this client with its own gpg key and perform an initial backup.

After this, client 1's secret key can be removed again: gpg --delete-secret-key LONG_KEY followed by gpg --delete-key LONG_KEY.

(Not removing it defeats the purpose of having a specific key per client - the workaround above doesn't seem entirely sensible from that perspective either, as the secret key needs to be shared temporarily.)

The third client should have been easy, but gpg-agent made it a bit more tricky. Obnam failed to run because it couldn't find gpg-agent. Several workarounds have been documented in the past, but they all ended up not working anymore since version 2.1 of gpg-agent. I ended up ^1 having to modify ~/.bashrc as follows:

function gpg-update() {
    GPG_PID=$(pidof gpg-agent)
    GPG_AGENT_INFO=${HOME}/.gnupg/S.gpg-agent:$GPG_PID:1
    export GPG_AGENT_INFO
}

gpg-update
Posted Thu 09 Jun 2016 20:41:06 AWST

Continuing a long tradition with announcing firsts, I wrote an R package recently, and made it available on Projects, a new section of the website. (Talking about my website on said website is also not exactly new.)

It still needs further work, as it really only supports AU public holidays for now, but it's exciting to be able to use freely available public data to make analysis a bit easier. The ability to know how many working days are left in the month is fundamental in forecasting a month-end result. Extensions of the package could include simple checks like if today's date is a working day, or more complex ones like the number of working days between given dates for instance.

In other news, I received a new gadget in the mail today.

RTL-SDR is a very cheap software defined radio that uses a DVB-T TV tuner dongle based on the RTL2832U chipset. With the combined efforts of Antti Palosaari , Eric Fry and Osmocom it was found that the signal I/Q data could be accessed directly, which allowed the DVB-T TV tuner to be converted into a wideband software defined radio via a new software driver.

Posted Wed 18 May 2016 21:09:02 AWST

This blog is powered by ikiwiki.