This is the third in a multi-part series in which we

explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow.In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up

Keras and Tensorflow for Rusing either the defaultCPU-based configuration, or the more complex and involved (but well worth it)GPU-based configurationunder the Windows environment.Read Part 2 here.Part 3 is an

introduction to the model building, training and evaluation process in Keras. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of hour and assess its performance.

.

Now that you can train your deep learning models on a GPU, the fun can really start. By the end of this series, we’ll be building interesting and complex models that predict multiple outputs, handle the sequential and temporal aspects of time series data, and even use custom cost functions that are particularly relevant to financial data. But before we get there, we’ll start with the basics.

In this post, we’ll build our first neural network in Keras, train it, and evaluate it. This will enable us to understand the basic building blocks of Keras, which is a prerequisite for building more advanced models.

## Problem Formulation

There are numerous possible ways to formulate a market forecasting problem. For the sake of this example, we will forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. That is, our model will attempt to classify the next hour’s market direction as either up or down.

### Data

Our data will consist of hourly EUR/USD exchange rate history obtained from FXCM (**IMPORTANT**: read the caveats and limitations associated with using past market data to predict the future here). Our data covers the period 2010 to 2017.

### Features

Our features will simply consist of a number of variables related to price action:

- Change in hourly closing price
- Change in hourly highest price
- Change in hourly lowest price
- Distance between the hourly high and close
- Distance between the hourly low and close
- Distance between the hourly high and low (the hourly range)

We will use several past values of these variables, as well as the current values, to predict the target. We’ll also include the hour of day as a feature in the hope of capturing intraday seasonality effects.** **

### Feature scaling

Training of neural networks normally proceeds more efficiently if we scale our input features to force them into a similar range. There are various scaling strategies throughout the deep learning literature (see for example Geoffrey Hinton’s Neural Networks for Machine Learning course), but scaling remains something of an art rather than a one-size-fits all type problem.

The standard approach to scaling involves normalizing the *entire* data set using the mean and standard deviation of each feature in the *training* set. This prevents data leakage from the test and validation sets into the training set, which can produce overly optimistic results. The problem with this approach for financial data is that it often results in scaled test or validation data that winds up being way outside the range of the training set. This is related to the problem of non-stationarity of financial data and is a significant issue. After all, if a model is asked to predict on data that is very different to its training data, it is unlikely to produce good results.

One way around this is to scale data relative to the recent past. This ensures that the test and validation data is always on the intended scale. But the downside is that we introduce an additional parameter to our model: the amount of data from the recent past that we use in our scaling function. So we end up introducing another problem to solve an existing one.

Like I said, feature scaling is something of an art form, particularly when dealing with data as poorly behaved as financial data!

We’ll do our model building and experimentation in R, but first we need to generate our data. There is a Zorro script named ‘keras_data_gen.c’ for creating our targets and scaled features, and for exporting that data to a CSV file in this download link.. The script will allow you to code your own features and targets, use different scaling strategies, and generate data for different instruments. Just make the changes, then click ‘Train’ on the Zorro GUI to export the data to file. If you’d prefer to just get your hands on the data used in this post, it’s also available via the download link..

Our target is the direction of the market over a period of one hour, which implies a classification problem. The target exported in the script is the actual dollar amount made or lost by going long the market at 0.01 lots, exclusive of trading costs. We need to convert this to a factor reflecting the market’s movement either up or down. More on this below.

Let’s import our data into R and take a closer look. First, here’s a time series plot of the first ten days of our scaled features:

You can see that our features are roughly on the same scale. Notice the first feature, V1, which corresponds to the hour of the day. It has been scaled using a slightly different approach to the other variables to ensure that the cyclical nature of that variable is maintained. See the code in the download link above for details.

Next, here’s a scatterplot matrix of our variables and target (the first ten days of data only):

Now that we’ve got our data, we’ll see if we can extract any predictive information using deep learning techniques. In this post, we’ll look at fully connected feed-forward networks, which are kind of the like the ‘Hello World’ example of deep learning. In later posts, we’ll explore some more interesting networks.

## Fully Connected Feed Forward Networks

A fully connected feed forward network is one in which every neuron in a particular layer is connected to every neuron in the subsequent layer, and in which information flows in one direction only, from input to output.

Here’s a schematic of such a network with an input layer, two hidden layers and an output layer consisting of a single neuron (source: datasciencecentral.com):

### Input data processing

It makes sense that our network would likely benefit from using not only the features for the current time step, but also a number of prior values as well, in order to predict the target. That means that we need to create features out of lagged values of our raw feature variables.

Thankfully, that’s easily accomplished using base R’s

embed() function, which also automatically drops the NA values which arise in the first (n) observations, where (n) is the number of lags to use as features. Here’s a function which returns an expanded data set consisting of the current features as well as their

lags lagged values. It assumes that the target is in the final column (and doesn’t embed lagged values of the target) and drops the relevant NA values from the target column.

| # function for creating features from lagged variables lag_variables_to_features <– function(data, num_lags=1) { d <– embed(data[, –ncol(data)], num_lags+1) # this automatically drops NA, assumes target in last column d <– cbind(d, data[(num_lags+1):nrow(data), ncol(data)]) # add column for target, dropping num_lags return(d) } |

Let’s test the function and take a look at its output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| # test lagging function set.seed(503) dat <– replicate(3, rnorm(10, 0, 1)) dat # [,1] [,2] [,3] # [1,] 0.355125070 -0.42202083 2.2040012 # [2,] -0.778893409 -0.03744167 0.4128119 # [3,] -0.757356957 -0.20609016 1.0322519 # [4,] 2.329800607 2.01835389 0.7804746 # [5,] 0.283974926 -0.60559854 2.5843431 # [6,] 1.281025216 -0.28414168 0.2339200 # [7,] -0.002363249 0.96044445 1.3501947 # [8,] 1.033770690 0.74774752 -0.4097266 # [9,] -0.431933268 -0.01286499 -0.3662180 # [10,] -0.342867464 -0.71862991 -1.0912861 dat <– lag_variables_to_features(dat, 2) dat # [,1] [,2] [,3] [,4] [,5] [,6] [,7] # [1,] -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.355125070 -0.42202083 1.0322519 # [2,] 2.329800607 2.01835389 -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.7804746 # [3,] 0.283974926 -0.60559854 2.329800607 2.01835389 -0.757356957 -0.20609016 2.5843431 # [4,] 1.281025216 -0.28414168 0.283974926 -0.60559854 2.329800607 2.01835389 0.2339200 # [5,] -0.002363249 0.96044445 1.281025216 -0.28414168 0.283974926 -0.60559854 1.3501947 # [6,] 1.033770690 0.74774752 -0.002363249 0.96044445 1.281025216 -0.28414168 -0.4097266 # [7,] -0.431933268 -0.01286499 1.033770690 0.74774752 -0.002363249 0.96044445 -0.3662180 # [8,] -0.342867464 -0.71862991 -0.431933268 -0.01286499 1.033770690 0.74774752 -1.0912861 |

You can see that the function returns a new dataset with the current features and their last two lagged values, while the target remains unchanged in the final column. Note that the two rows that wind up with NA values are automatically dropped.

Essentially, this approach makes new features out of lagged values of each feature. But here’s the thing about feed forward networks: they don’t distinguish between more recent values of our features and older values. Obviously the network differentiates between the different features that we create out of lagged values, and has the ability to discern relationships between them, but it doesn’t explicitly factor the sequential nature of the data.

That’s one of the major limitations of fully connected feed forward networks applied to time series forecasting exercises, and one of the motivators of recurrent architectures, which we will get to soon enough.

### Introducing the Keras sequential model

Now that we can process our input data, we can start experimenting with the model building process. The best place to start is Keras’ sequential model, which is essentially a paradigm for constructing deep neural networks, one layer at a time, under the assumption that the network consists of a linear stack of layers and has only a single set of inputs and outputs. You’ll find that this assumption holds for the majority of networks that you build, and it provides a very modular and efficient method of experimenting with such networks. We’ll use the sequential model quite a lot over the coming posts before getting into some more complex models that don’t fit this paradigm.

In Keras, the model building and exploration workflow typically consists of the following steps:

- Define the input data and the target. Split the data into training, validation and test sets.
- Define a stack of layers that will be used to predict the target from the input. This is the step that defines the network architecture.
- Configure the model training process with an appropriate loss function, optimizer and various metrics to be monitored.
- Train the model by repeatedly exposing it to the training data and updating the network weights according to the loss function and optimizer chosen in the previous step.
- Evaluate the model on the test set.

Let’s go through each step.

#### Set up our data

Note that we convert our target into a binary outcome, which enables us to build a classifier.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| ## load, process and split data ##
# load path <– “C:/Users/Kris/Data/” XY <– read.csv(paste0(path, ‘EURUSD_L_2010_2017.csv’), header = F) XY <– as.matrix(XY)
# create lags lags <– 7 proc <– lag_variables_to_features(XY, lags)
# split into training, validation and test sets train_length <– floor(0.5*nrow(proc)) val_length <– floor(0.25*nrow(proc))
X_train <– proc[1:train_length, –ncol(proc)] Y_train_raw <– proc[1:train_length, ncol(proc)] Y_train <– ifelse(Y_train_raw > 0, 1, 0)
X_val <– proc[(train_length+1):(train_length+val_length), –ncol(proc)] Y_val_raw <– proc[(train_length+1):(train_length+val_length), ncol(proc)] Y_val <– ifelse(Y_val_raw > 0, 1, 0)
X_test <– proc[(train_length+val_length+1):nrow(proc), –ncol(proc)] Y_test_raw <– proc[(train_length+val_length+1):nrow(proc), ncol(proc)] Y_test <– ifelse(Y_test_raw > 0, 1, 0) |

#### Define network architecture

This defines a fully connected feed forward network with three hidden layers, each of which consists of 150 neurons with the rectified linear (

‘relu’ ) activation function. If you need a refresher on activation functions, check out this post on neural network basics.

layer_dense() defines a fully connected layer – that is, one in which each input is connected to every neuron in the layer. Note that for the first layer, we need to define the input shape, which is simply the number of features in our data set. We only need to do this on the first layer; each subsequent layer gets its input shape from the output of the prior layer.

layer_dense() has many arguments in addition to the activation function that we specified here, including the weight initialization scheme and various regularization settings. We use the defaults in this example.

Keras implements many other layers, some of which we’ll explore in subsequent posts.

In this example, our network terminates with an output layer consisting of a single neuron with the sigmoid activation function. This activation function converts the output to a value between 0 and 1, which we interpret as the probability associated with the positive class in a binary classification problem (in this case, the value 1, corresponding to an up move).

To get an overview of the model, call

summary(model) and observe the output:

| ___________________________________________________________________________________________________ Layer (type) Output Shape Param # =================================================================================================== dense_1 (Dense) (None, 150) 8550 ___________________________________________________________________________________________________ dense_2 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_3 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_4 (Dense) (None, 1) 151 =================================================================================================== Total params: 54,001 Trainable params: 54,001 Non-trainable params: 0 ___________________________________________________________________________________________________ > |

This model architecture could be better described as ‘wide’ as opposed to ‘deep’ and it consists of around 54,000 trainable parameters. This is more than the number of observations in our data set, and has implications for the ability of our network to overfit.

#### Configure the training process

Configuration of the training process is accomplished via the

keras::compile() function, in which we specify a loss function, an optimizer, and a set of metrics to monitor during training. Keras implements a suite of loss functions, optimizers and metrics out of the box, and in this example we’ll choose some sensible defaults:

| model %>% compile( loss = ‘binary_crossentropy’, optimizer = optimizer_rmsprop(lr=0.0001), metrics = c(‘accuracy’) ) |

The

‘binary_crossentropy’ loss function is standard for binary classifiers and the

rmsprop() optimizer is nearly always a good choice. Here we specify a learning rate of 0.0001, but finding a sensible value typically requires some experimentation. Finally, we tell Keras to keep track of our model’s accuracy, as well as the loss during the training process.

An important consideration regarding loss functions for financial prediction is that the standard loss functions rarely capture the realities of trading. For example, consider a regression model that predicts a price change over some time horizon trained using the mean absolute error of the predictions. Say the model predicted a price change of 20 ticks, but the actual outcome was 10 ticks. In practical trading terms, such an outcome would result in a profit of 10 ticks – not a terrible outcome at all. But that result is treated the same as a prediction of 5 ticks that resulted in an actual outcome of -5 ticks, which would result in a loss of 5 ticks in a trading model. That’s because the loss function is only concerned with the magnitude of the difference between the predicted and actual outcomes – but that doesn’t tell the full story. Clearly, we’d likely to penalize the latter error more than the former. To do that, we need to implement our own custom loss functions. I’ll show you how to do that in a later post, but for now it’s important to be cognizant of the limitations of our model training process.

#### Train the model

We can train our model using

keras::fit() , which exposes our model to subsequent batches of training data, updating the network’s weights after each batch. Training progresses for a specified number of epochs and performance is monitored on both the training and validation sets.

We would normally like to stop training at the number of epochs that maximize the model’s performance on the validation set. That is, at the point just before the network starts to overfit. The problem is we can’t know *a priori* how many training epochs this requires.

We can see that loss on the training set continuously decreases while accuracy almost continuously increases as training progresses. That is expected given the power of our network to overfit. But note the small decrease in validation loss and the bump in validation accuracy that we also get out to about 40 epochs before stalling.

A validation accuracy of a little under 53% is certainly not the sort of result that would turn heads in the classic applications of deep learning, like image classification. But trading is an interesting application, because we don’t necessarily need the same sort of performance to make money. But is a validation accuracy of 53% enough to give us some out of sample profits? Let’s find out by evaluating our model on the test set.

#### Evaluate the model out of sample

Here’s how to remove the fully trained model, load the model with the highest validation accuracy and evaluate it on the test set, with the output shown below the code:

| rm(model) model <– keras:::keras$models$load_model(filepath) model %>% evaluate(X_test, Y_test)
# output: # 12004/12004 [==============================] – 2s 197us/step # $loss # [1] 0.691
# $acc # [1] 0.523 |

We end up with a test set accuracy that is only slightly worse than our validation accuracy.

But accuracy is one thing, profitability is another. To assess the profitability of our model on the test set, we need the actual predictions on the test set. We can get the predicted classes via

predict_classes() , but I prefer to look at the actual output of the sigmoid function in the final layer of the model. That enables you to use a prediction threshold in your decision making, for example only entering a long trade when the output is greater than 0.6, say.

Here’s how to get the test set predictions and implement some simple, frictionless trading logic that assigns the target as an individual trade’s profit or loss when the prediction is greater than some threshold (equivalent to a buy) and the negative of the target when the prediction is less than 1 minus the threshold (equivalent to a sell) :

| preds <– model %>% predict_proba(X_test) threshold <– 0.5 trades <– ifelse(preds >= threshold, Y_test_raw, ifelse(preds <= 1–threshold, –Y_test_raw, 0)) plot(cumsum(trades), type=‘l’) |

This results in the following equity curve (the y-axis is measured in dollars of profit from buying and selling the minimum position size of 0.01 lots):

I think that’s quite an amazing equity curve that demonstrates the potential of even a very small edge. However, note that adding typical retail transaction costs would destroy this small edge, which suggests that longer holding periods are more sensible targets, or that higher accuracies are required in practice.

Also note that you might get different results depending on the initial weights used in your network, as the weights aren’t guaranteed to converge to the same values when initialized to different values. If you repeat the training and evaluation process a number of times, you’ll find that validation accuracies in the range of 52-53% occur most of the time, but while most produce profitable out of sample equity curves, the range of performance is actually quite significant. This implies that there might be benefit in combining the predictions of multiple models using ensemble methods.

## What’s next?

Before we get into advanced model architectures, in the next unit I’ll show you:

- How to fight overfitting and push your models to generalize better.
- One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
- How to interrogate and visualize the training process in real time.

## Conclusions

This post demonstrated how to process multivariate time series data for use in a feed forward neural network, as well as how to construct, train and evaluate such a network using Keras’ sequential model paradigm. While we uncovered a slim edge in predicting the EUR/USD exchange rate, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

**Where to from here?**

*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*Be sure to check out Part 1 and Part 2 of this series on deep learning applications for trading.*