blog

What

After having read the first part of a Rcpp tutorial which compared native R vs C++ implementations of a Fibonacci sequence generator, I resorted to drawing the so-called Golden Spiral using R.

Details

Libraries used in this example are the following

library(ggplot2)
library(plotrix)

In polar coordinates, this special instance of a logarithmic spiral's functional representation can be simplified to r(t) = e(0.0635*t) For every quarter turn, the corresponding point on the spiral is a factor of phi further from the origin (r is this distance), with phi the golden ratio - the same one obtained from dividing any 2 sufficiently big successive numbers on a Fibonacci sequence, which is how the golden ratio, the golden spiral and Fibonacci sequences are linked concepts!

polar_golden_spiral <- function(theta) exp(0.30635*theta)

Let's do 2 full circles. First, I create a sequence of angle values theta. Since 2 * PI is the equivalent of a circle in polar coordinates, we need to have distances from origin for values between 0 and 4 * PI.

seq_theta <- seq(0,4*pi,by=0.05)

dist_from_origin <- sapply(seq_theta,polar_golden_spiral)

Plotting the function using coordpolar in ggplot2 does not work as intended. Unexpectedly, the x axis keeps extending instead of circling back once a full circle is reached. Turns out coordpolar might not really be intended to plot elements in polar vector format.

ggplot(data.frame(x = seq_theta, y = dist_from_origin), aes(x,y)) +
    geom_point() +
    coord_polar(theta="x")

failed attempt plotting golden spiral

To ensure what I was trying to do is possible, I employ a specialised plotfunction instead

plotrix::radial.plot(dist_from_origin, seq_theta,rp.type="s", point.col = "blue")

Plotrix golden spiral

With that established and the original objective of the exercise achieved, it still would be nice to be able to accomplish this using ggplot2. To do so, the created sequence above needs to be converted to cartesian coordinates. The rectangular function equivalent of the golden spiral function r(t) defined above is a(t) = (r(t) cos(t), r(t) sin(t)) It's not too hard to come up with a hack to convert one to the other.

cartesian_golden_spiral <- function(theta) {
    a <- polar_golden_spiral(theta)*cos(theta)
    b <- polar_golden_spiral(theta)*sin(theta)
    c(a,b)
}

Applying that function to the same series of angles from above and stitching the resulting coordinates in a data frame. Note I'm enclosing the first expression in brackets, which prints it immediately, which is useful when the script is run interactively.

(serie <- sapply(seq_theta,cartesian_golden_spiral))
df <- data.frame(t(serie))

Result

With everything now ready in the right coordinate system, it's now only a matter of setting some options to make the output look acceptable.

ggplot(df, aes(x=X1,y=X2)) +
    geom_path(color="blue") +
    theme(panel.grid.minor = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank()) +
    scale_y_continuous(breaks = seq(-20,20,by=10)) +
    scale_x_continuous(breaks = seq(-20,50,by=10)) +
    coord_fixed() +
    labs(title = "Golden spiral",
     subtitle = "Another view on the Fibonacci sequence",
     caption = "Maths from https://www.intmath.com/blog/mathematics/golden-spiral-6512\nCode errors mine.",
     x = "",
     y = "")

ggplot2 version of Golden Spiral

Note on how this post was written.

After a long hiatus, I set about using emacs, org-mode and ESS together to create this post. All code is part of an .org file, and gets exported to markdown using the orgmode conversion - C-c C-e m m.

Posted Mon 16 Sep 2019 22:03:03 AWST Tags:

The May/June 2019 issue of Foreign Affairs contains an article by Christian Brose, titled "The New Revolution in Military Affairs".

What struck me while reading the article is how much of an analogy can be drawn between what is happening to businesses worldwide, and what the author writes about the future in military technology and its trailing adoption in the United States of America's military.

The transformation he describes is about the core process concerning militaries, the so called "kill chain". Thanks to technological advances, including artificial intelligence, that process can be rapdidly accelerated, offering a competitive advantage to the owner of the technology.

Following quotes struck me in particular:

Instead of thinking systematically about buying faster, more effective kill chains that could be built now, Washington poured money into newer versions of old military platforms and prayed for technological miracles to come.

The question, accordingly, is not how new technologies can improve the U.S. military’s ability to do what it already does but how they can enable it to operate in new ways.

A military made up of small numbers of large, expensive, heavily manned, and hard-to-replace systems will not survive on future battlefields, where swarms of intelligent machines will deliver violence at a greater volume and higher velocity than ever before. Success will require a different kind of military, one built around large numbers of small, inexpensive, expendable, and highly autonomous systems.

The same could be written about so many companies that haven't taken up the strategy of competing on analytics.

Replacing the U.S. military with banking sector for instance, formerly very profitable and seemingly unbeatable big banks have over the past decade found their banking software to be too rigid. Instead of investing in new products and services, they continued to rely on what they had been doing for the prior hundred years. They invested in upgrading their core systems, often with little payoff. While they were doing that, small fintech firms appeared, excelling at just a small fraction of what a bank considered its playing field. In those areas, these new players innovated much more quickly, resulting in far more efficient and effective service delivery.

At the core of many of these innovations lies data. The author likes China's stockpiling of data as to that of oil, but the following quote was particularly relevant in how it describes the use of that stockpile of data to inform decisioning.

Every autonomous system will be able to process and make sense of the information it gathers on its own, without relying on a command hub.

The analogy is clear - for years, organisations have been trying to ensure they knew the "single source of truth". Tightly coupling all business functions to a central ERP system was usually the answer. Just like in the military, it can now often be better to have many small functions be performed on the perifery of a company's systems, accepting some duplication of data and directional accuracy to deliver quicker, more cost-effective results - using expendable solutions. The challenges to communicate effectively between these semi-autonomous systems are noted.

Not insignificantly, the author poses "future militaries will be distinguished by the quality of their software, especially their artificial intelligence" - i.e. countries are competing on analytics, also in the military sphere.

The article ends with some advise to government leadership - make the transormation a priority, drive the change forward, recast cultures and ensure correct incentives are in place.

Posted Tue 11 Jun 2019 20:46:22 AWST Tags:

Mindmap on setting up analytics practice

Ideas courtesy of Abhi Seth, Head of Data Science & Analytics at Honeywell Aerospace.

Posted Tue 09 Apr 2019 07:59:58 AWST Tags:

Paul Romer may well be the first Nobel prize winner using Jupyter notebooks in his scientific workflow. On his blog, he explains his reasoning.

My key takeaway from the article: he's having fun.

Posted Fri 12 Oct 2018 20:00:01 AWST Tags:

It started of as an attempt to analyse some data stored in Apache Kafka using R, and ended up becoming the start of an R package to interact with Confluent's REST Proxy API.

While rkafka already allows the creation of a producer and a consumer from R, writing some R functions interfacing with its REST API was an interesting way to learn a bit more about Kafka's inner workings, and demonstrate how easy it is to interact with any REST API from R thanks to httr.

The result is available to clone on my git server.

Posted Fri 14 Sep 2018 21:53:53 AWST Tags:

Working in analytics these days, the concept of big data has been firmly established. Smart engineers have been developing cool technology to work with it for a while now. The Apache Software Foundation has emerged as a hub for many of these - Ambari, Hadoop, Hive, Kafka, Nifi, Pig, Zookeeper - the list goes on.

While I'm mostly interested in improving business outcomes applying analytics, I'm also excited to work with some of these tools to make that easier.

Over the past few weeks, I have been exploring some tools, installing them on my laptop or a server and giving them a spin. Thanks to Confluent, the founders of Kafka it is super easy to try out Kafka, Zookeeper, KSQL and their REST API. They all come in a pre-compiled tarball which just works on Arch Linux. (After trying to compile some of these, this is no luxury - these apps are very interestingly built...) Once unpacked, all it takes to get started is:

./bin/confluent start

I also spun up an instance of nifi, which I used to monitor a (json-ised) apache2 webserver log. Every new line added to that log goes as a message to Kafka.

Apache Nifi configuration

A processor monitoring a file (tailing) copies every new line over to another processor publishing it to a Kafka topic. The Tailfile monitor includes options for rolling filenames, and what delineates each message. I set it up to process a custom logfile from my webserver, which was defined to produce JSON messages instead of the somewhat cumbersome to process standard logfile output (defined in apache2.conf, enabled in the webserver conf):

LogFormat "{ \"time\":\"%t\", \"remoteIP\":\"%a\", \"host\":\"%V\", \"request\":\"%U\", \"query\":\"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\", \"size\":\"%O\" }" leapache

All the hard work is being done by Nifi. (Something like

tail -F /var/log/apache2/access.log | kafka-console-producer.sh --broker-list localhost:9092 --topic accesslogapache

would probably be close to the CLI equivalent on a single-node system like my test setup, with the -F option to ensure the log rotation doesn't break things. Not sure how the message demarcator would need to be configured.)

The above results in a Kafka message stream with every request hitting my webserver in real-time available for further analysis.

Posted Tue 11 Sep 2018 21:09:06 AWST Tags:

It appears to me the cross-industry standard process for data mining (CRISP-DM) is still, almost a quarter century after first having been formulated, a valuable framework to guide management of a data science team. Start with building business understanding, followed by understanding the data, preparing it, moving from modeling to solve the problem over to evaluating the model and ending by deploying it. The framework is iterative, and allows for back-and-forth between these steps based on what's learned in the later steps.

CRISP-DM

It doesn't put too great an emphasis on scheduling the activities, but focuses on the value creation.

The Observe-Orient-Decide-Act (OODA) loop from John Boyd seems to be an analogue concept. Competing businesses would then be advised to speed up their cycling through the CRISP-DM loop, as that's how Boyd stated advantage is obtained - by cycling through the OODA loops more quickly than ones opponent. Most interestingly, in both loops it's a common pitfall to skip the last step - deploying the model / acting.

OODA loop

(Image by Patrick Edwin Moran - Own work, CC BY 3.0)

Posted Tue 19 Jun 2018 20:51:36 AWST Tags:

I have been asked a few times recently about my management style. First, while applying for a position myself. Next, less expected, by a member of the org I joined as well as by a candidate I interviewed for a position in the team.

My answer was not very concise, as I lacked the framework knowledge to be so.

Today, I believe to have stumbled on a description of the style I practice (or certainly aim to) most often on Adam Drake's blog. Its name? Mission Command. (The key alternative being detailed command.)

Now this is an interesting revelation for more than one reason. I consider it a positive thing I can now more clearly articulate how I naturally tend to work as a team leader. It now becomes clear too what is important to me, by reviewing the key principles:

  • Build cohesive teams through mutual trust.
  • Create shared understanding.
  • Provide a clear commander’s intent.
  • Exercise disciplined initiative.
  • Use mission orders.
  • Accept prudent risk.

Reviewing these principles in detail, this style of leadership should not be mistaken for laissez-faire. Providing clear commander's intent, creating shared understanding, using mission orders are very active principles for the leader. For the subordinate, the need to exercise disciplined initiative is clearly also not a free-for-all. The need for mutual trust for this to work cannot be emphasised enough.

Posted Wed 14 Feb 2018 15:38:33 AWST Tags:

Dries Buytaert wrote last week about intending to use social media less in 2018. As an entrepreneur developing a CMS, he has a vested interest in preventing the world moving to see the internet as being either Facebook, Instagram or Twitter (or reversing that current-state maybe). Still, I believe he is genuinely concerned about the effect of using social media on our thinking. This partly because I share the observation. Despite having been an early adopter, I disabled my Facebook account a year or two ago already. I'm currently in doubt whether I should not do the same with Twitter. I notice it actually is not as good a source of news as classic news sites - headlines simply get repeated numerous times when major events happen, and other news is equally easily noticed browsing a traditional website. Fringe and mainstream thinkers alike in the space of management, R stats, computing hardware etc are a different matter. While, as Dries notices, their micro-messages are typically not well worked out, they do make me aware of what they have blogged about - for those that actually still blog. So is it a matter of trying to increase my Nexcloud newsreader use, maybe during dedicated reading time, and no longer opening the Twitter homepage on my phone at random times throughout the day, and conceding short statements without a more worked out bit of content behind it are not all that useful?

The above focuses on consuming content of others. To foster conversations, which arguably is the intent of social media too, we might need something like webmentions to pick up steam too.

Posted Mon 08 Jan 2018 21:04:09 AWST Tags:

The Internet Archive contains a dataset from the NYC Taxi and Limousine Commission, obtained under a FOIA request. It includes a listing of each taxi ride in 2013, its number of passengers, distance covered, start and stop locations and more.

The dataset is a wopping 3.9 GB compressed, or shy of 30 GB uncompressed. As such, it is quite unwieldy in R.

As I was interested in summarised data for my first analysis, I decided to load the CSV files in a SQLite database, query it using SQL and storing the resulting output as CSV file again - far smaller though, as I only needed 2 columns for each day of the 1 year of data.

The process went as follows.

First extract the CSV file from the 7z compressed archive.

7z e ../trip_data.7z trip_data_1.csv

and the same for the other months. (As I was running low on disk space, I had to do 2 months at a time only.) Next, import it in a SQLite db.

echo -e '.mode csv \n.import trip_data_1.csv trips2013' | sqlite3 NYCtaxi.db

Unfortunately the header row separates with ", ", and column names now start with a space. This does not happen when importing in the sqlite3 command line - tbd why. As a result, those column names need to be quoted in the query below.

Repeat this import for all the months - as mentioned, I did 2 at time.

Save the output we need in temporary csv files:

sqlite3 -header -csv trips2013.db 'select DATE(" pickup_datetime"), count(" passenger_count") AS rides, sum(" passenger_count") AS passengers from trips2013 GROUP BY DATE(" pickup_datetime");' > 01-02.csv

Remove the archives and repeat:

rm trip_data_?.csv
rm trips2013.db

Next, I moved on to the actual analysis work in R.

Looking at the number of trips per day on a calendar heatmap reveals something odd - the first week of August has very few rides compared to any other week. While it's known people in NY tend to leave the city in August, this drop is odd.

Calendar heatmap of trips

Deciding to ignore August altogether, and zooming in on occupancy rate of the taxis rather than the absolute number or rides, reveals an interesting insight - people travel together far more in weekends and on public holidays!

Occupancy heatmap

Just looking at the calendar heatmap it's possible to determine 1 Jan 2013 was a Tuesday and point out Memorial Day as the last Monday of May, Labour day in September, Thanksgiving day and even Black Friday at the end of November, and of course the silly season at the end of the year!)

The dataset contains even more interesting information in its geo-location columns I imagine!

Posted Thu 30 Nov 2017 21:36:05 AWST Tags:

This blog is powered by ikiwiki.