Building an (somewhat) automated analysis suite on a Mac Part 2


Update: The same techniques outlined here will also work on Windows, though the specifics may vary - same with Part 1 (it's probably even easier to setup the database on Windows) 

Following on from Part 1 - it's unsurprisingly... Part 2! So we're at a stage now where we can read data in Excel from our database... but there's no real automation going on here, right?

Well, let's begin to solve that. There are a few goals here:
  1. Keep it simple, stupid
  2. Keep it flexible
  3. Expect change
  4. New data sources should be quick to add
  5. Keep it simple, stupid
To be honest, this is the goal of most good software systems, sometimes people chuck in DRY, but I'm not going to call it out as a goal - lest I take it to extremes and waste time. Though I will add, if you ignore DRY as a hard and fast design principle as I'm going to do, you should follow The Boy Scout Rule to identify and remove any duplication over time.

So first off, we need to think about what this system is going to look like, while it's fun and some people might think it's "Agile" (it isn't) to jump straight into the code without a plan - it's always a good idea to sketch things out and have a rough idea of where we are going. (Anecdote: I once had a co-worker tell me business plans were a waste of time - he then joined a new company as some kind of director/co-founder, needless to say when I got a call from one of their recruiters, I turned down the job on the spot.)

Thankfully in this instance our system is quite a simple one and the reason I'm breaking it down quite so much is because several people I've spoken with online recently  have expressed an interest in learning how to code, and I find that understanding some of these more philosophical aspects of programming - i.e. the "Why?" help you focus better on the "How?" - blindly copying and pasting bits of code from a tutorial will not teach you how to think like a programmer or think how to break a larger project down into easy to digest pieces. (Though it will teach you some of the "rules" of programming)

The first part of our system, we already solved in Part 1 - we have a database, and we have a way of accessing that database in a way where we can manipulate the data using common tools and leveraging (well, unless you are me and still need to go through a few of these) existing toolsets and knowledge we have.

But what about the rest of the system? Well, first we need to discuss a little bit about what we want to accomplish.

  1. Take some kind of data from an online source
  2. Do some basic manipulations and tidy up on it
  3. Put it in a database
  4. Make this process repeatable
  5. Make this process expandable
  6. Avoid duplicated data
So we have some defined objectives and goals now (yeah, I know they are kind of mixed together). To cut a long story short, we are going to build a system to download data from several online sources, massage the data so that it goes into a database, and use a combination of things to ensure we don't put duplicated data into the database when we re-run our program.

The observant amongst you will have noticed that I haven't yet mentioned "automation" - well, spoilers, the automation aspect will be in Part 3 of this series, however, our goal here is to give us a framework that we can automate, so they're related ;)

So the easiest way to do this is to make use of REST API calls, in short we will basically download some JSON data from a web URL. For those not familiar with JSON, there's an example below, it's basically just some keys and some values stored in a format that is fairly easy for both machines and people to read (though it can get a bit confusing for people as the data complexity increases).

For this demo, we're going to use data about Bitcoin, the prime reason for this is that there is lots of readily available data that is free - and as can be seen from some of the data sources, there are some interesting choices available to us that would be much more work to put together for other assets (though I look forward to doing something like this in the future once I've got this system working!)

So, our system diagram looks a little bit like this:

(Diagram created on draw.io)

We download data from many third party systems, processes it, and put it in our database. Before we jump into programming lets list out some of the desired data sources - not all of these have a publicly available API, but they are all public data, in some instances we might have to do a bit of research to find out how to pull down the data which are out of the scope of this post.

The following table is directly copy pasted from a document I wrote when I first had this idea.


Name
Documentation
Notes
Desired Data
Tiingo
Gives good prices across many exchanges
Price (OHLC), Trading Volume, Number of Trades
Tiingo News
Good news feed with many sources
News volume, possibly sentiment
StockTwits
Sadly public API is quit limited, but maybe can screenscrape 
Public sentiment
CoinmarketCap
Data about the market cap and current rating of BTC
Rank, Supply, Market Cap, 24h volume, etc
Blockchain.info
Information about the state of the BTC network
TX Rate, Current Difficulty, Next Retarget, Miners Rev, etc, etc 
Google Trends
Get an idea of how much bitcoin/btc is being search for over time
bitcoin, btc, blockchain


Another idea that I had later, which I ended up putting in, was GitHub - from here we can get information about the number of times Bitcoin's code was changed and when (along with other metrics) given the uniqueness of Cryptocurrencies as an asset class, I thought this was something that could prove useful. To think broader picture - perhaps if you are looking at a company that is doing some kind of open source software as part of it's business you may wish to do something similar.

You'll notice Tiingo in there again, I tend to use Tiingo quite a lot as it's cheap, has pretty clean, useful data compared to most other sources and the founder Rishi is a really nice guy. However, you could also pull some of this same data from (i.e. the OHLC crypto prices) from the exchanges directly - however the consolidated feed from Tiingo saves me quite a bit of time. Another consideration and one that isn't yet catered to, is that the Tiingo feed provides a weighted average across all the exchanges (it's also possible to pull information about each price on each exchange) so a future expansion might be to store not only this, but also the data from each exchange individually to see if there are any Stat Arb opportunities or perhaps some other oddities in the data we can make use of.

Stop... Code Time!

So, we now have a rough idea of what we want our program to do, it's time to focus on the "How?" (with some more theory, sorry) we are going to do it.

The language I'm going to use for this is Node.js, mostly because I'm familiar with it, not really because it is superior to other languages - Python, Ruby or Go could be an equally good fit here (because all of them have pretty native support for making web requests and dealing with the responses, though really pretty much any programming language will work just fine for this task).

That being said, Node.js does give us a few advantages, out of the box it is designed to run things asynchronously - which means we can request things from multiple data sources at the same time, without waiting for the first one to complete. In the real world, what all this means is - our program will run comparatively quickly (lets be clear here - this isn't going to be capable of doing any sort of HFT, that's not our goal and frankly, it's beyond my abilities!) and the J in JSON stands for JavaScript, so our data is already in a format that Node.js can easily understand.

If we again think about our overall system, our program needs to know about a few things.
  1. The database we set up
  2. Where and how to get data from
  3. What that data is going to look like
The first two should be pretty self explanatory, but a few of you might be asking about point 3. Our program needs to be aware of what the data incoming looks like, because we are going to translate it form JSON into a Database record, effectively, we are going to map one data structure to another so in order to do so, our program needs to know what the incoming data is going to look like in order to handle it.

To assist us with this, we are going to use what's known as Object Relational-Mapping or (ORM) the main beauty is that this handles all the SQL or database queries for us, so we don't need to think about them while we are writing our program, we just focus on the data. Obviously for this to work, we do need to tell the ORM Tool (or simply ORM for short) a bit about what our data looks like.

As a human we do this by looking at the documentation for the API we want to use, and also some examples of the data.

The particular ORM I am using is one called Sequelize - again I choose this mostly because I am familiar with it, but it's also a fairly stable and well tested framework at this point, we are going to use the concept of "models" to define our data, and also put some validations on it.

So if we have some JSON data like this (from CoinmarketCap), with a little bit of documentation reading and guesswork...

We can end up with a "model" in our code like the below to represent it.

Now, this could be further optimised still, in terms of data types and validations, but again I was trying to get this up and running quite quickly.

The key point here to note is that we have defined one of our entries as a "Primary Key" this, combined with the natural power of a the database is what allows us to avoid duplication. The Primary Key must be unique (so we do have to be a bit careful when we select it!) this lets us use a very lazy method to ensure uniqueness, we just try to insert any data we get from our remote sources and if the Primary Key is the same as something in the database, it will refuse to insert it in - simple! Another tactic we could use is the UPSERT which will attempt to either UPDATE or INSERT - in short if the Primary Key matches it will update the existing data, otherwise it will simply update it.

The other thing to note is that we set a data-type for everything so the database knows what kind of data it's storing (by default JSON is effectively just treated as text, which isn't really what we want) FLOAT is generally not deemed to be the best fit for financial data, but as I'm not going to be doing any operations on it within the database
(we'll be using Excel for that), I think it will be ok.

So.. there's a representation of the data to be stored and it exists in code... The next thing we need is something to grab that data and convert it into these models, thankfully, that too is quite simple.
We do have to do a bit of magic here, as we get multiple entries from our data source to store in the database, none of these queries by default will wait for the first to complete before starting the second and later on in our program, we do need to be able to determine if we have completed all the downloads and inserts before we exit the program (otherwise we would exit too early and lose data). That's effectively what the "Promise.all" is all about in the "_storeData()" part of this program.

I won't go too much into details about each line of the code, but it's wrapped up in a Class and to do this I'm making use of JavaScript's new ES6 features. The reason for this is that I want to create quite a few of these data-grabbers, but I want them to be standardised, this means I can create a very simple bit of code that sits above these, which loads all of them in, does some configuration work, tells them to start work and waits till they are all done before exiting.

At the very top of our class, we "require" our model, this simply allows this bit of code to know about the model we defined earlier, the other main thing we bring in here is "request-promise" which is just a helper library to make web-requests super easy.

The rest of this class is pretty simple, we have one method which sets the "ticker" we are going to use (different APIs call bitcoin by different names - and remember in the future, we want to try to re-use some of this code!) - I mentioned I wouldn't follow DRY throughout this, and actually, this particular method is the same in each data grabbing class - we could use the power of inheritance to write it only once, but for so little code, I don't really see it as being worth it - though as I scale this up it will likely be one of the first things to be tidied up.

Next up we have beginImport() and _importData() - these are actually sort of the same thing, just one is intended to be used within the context of the class, and the other is for things external to it. JavaScript unfortunately has no real way of enforcing "private" methods but this is good enough for me. So in summary beginImport() is there for things outside the class to use, and _importData() is what does all the real work.

So in _importData() all we are doing is setting up the parameters for our web request, sending it off and then stating that we want _storeData() to take care of the results.

In store data we very simply bind the data from the JSON returned to our model - also note the use of the word "upsert" - so this will immediately store it in the database and let us know when it's done.

That's really it!

Finally we have a wrapper around all of these models and "data-grabbers" that just sets up some config and runs them all one after the other - it's possible to make this even more dynamic, but right now it's just as easy to do it this way. The only magic in this wrapper is the "Promise.all" statement - Promises are a bit complicated, but basically they can return at any time - so you can't rely on their order, in our case we do need to know when they have all returned so that we can exit our program (if we blindly run them then exit, our program will have shut down before we get the data back!)

The full source code for all this is available on github - I haven't yet got round to documenting exactly how to run it, but promise I will do! This will also be covered in Part 3 where I'm going to transfer the database and the code onto a Raspberry Pi and make our program run on a schedule so that whenever we open up Excel we have the latest data available to us!


Hope this was useful to some! Peace out homeslices.



Comments

Popular posts from this blog

Sentiment Analysis with Tiingo and Google

Building an (somewhat) automated analysis suite on a Mac Part 1