How Cadre Uses Machine Learning to Target Real Estate Markets
Market selection is a fundamental aspect of real estate investing — choosing the right market helps focus a real estate investor’s time and money towards the subset of transactions that may be more likely to outperform over time.
Traditionally, real estate investors select markets based on market fundamentals and use heuristics or occasionally simple statistical techniques to build selection metrics. As a result, these approaches often present a myopic view of historical and forward-looking market trends. At Cadre, we believe that by leveraging alternative datasets alongside traditional data and using cutting-edge machine learning techniques, we can gain a better understanding of markets and in turn make more informed investment decisions for our clients.
Identifying and ranking promising markets is no easy task. Ground truth data is sparse and rarely updated, the life of an asset is long, and the cost of a bad prediction is high.
In the past, our models would assess market potential by forecasting future rent prices across markets. More recently, we have shifted our models to focus on RevPAF, or revenue per available foot. Defined as the product of occupancy rate and effective rent, RevPAF is a measure tracked by real estate professionals to index markets based on earnings from all assets. Instead of simply using rent, RevPAF can be used as an all-encompassing score to measure earning potential in a market because it includes top line revenue, which is one of the most influential drivers of NOI (Net Operating Income) growth, or an asset’s value.
Modeling RevPAF one, three, and five years ahead allows to us to gain a better understanding of market growth and subsequently reduces our search space for deals.
In this post, we will walk through how we use machine learning techniques to model future RevPAF growth. While our internal models compute RevPAF at a submarket and ZIP-code level, this post will describe models that forecast RevPAF growth at the MSA (Metropolitan Statistical Area) level.
Based on the industry knowledge of our seasoned in-house investing team, our working hypothesis is the following:
Generally, markets with strong demographics (jobs, median income, population, educational attainment, etc) lead to strong market fundamentals (rents, occupancy, inventory, etc) that in turn drive RevPAF growth. In other words, if we were to take a current snapshot of our dataset and rank every market by their demographic attributes, it should align with the outcomes of our RevPAF growth models.
Here’s how we figured out whether our hunch was right.
In order to utilize RevPAF to identify promising markets, we want to be able to do three things:
- Model RevPAF growth 1, 3, and 5 years ahead
- Understand what drives RevPAF
- Be able to visualize current market snapshot with predictions
The following diagram describes the steps we take in order to achieve these goals. We’ll describe each of these steps in detail:
The first step in our analysis is to consolidate a time series dataset describing market behavior. Our multi-dimensional datasets include, but are not limited to:
- Economic and demographics data
- Market fundamentals data
- Local business data
- Social data
- Alternative datasets (think schools, foot traffic, crime, businesses, etc)
Our focus as a business is investing in CRE (commercial real estate). Unfortunately, CRE data — whether transaction data or historical market trends data — tends to be sparse and often lacks an extensive history. With around seven million total multi-family and office buildings in the US, annual transaction volumes are low (in the thousands), and the MSA aggregate data isn’t large enough for models to capture patterns without overfitting. To solve for this, we have spent a lot of time assessing and acquiring alternative datasets to use alongside traditional CRE datasets as proxies to evaluate markets.
Each dataset that we ingest is sliced by geography (latitude/longitude, market, submarket, MSA) and a temporal component (monthly, quarterly, annually). We form a feature set by aggregating all lat/long, ZIP Code, and submarket data up to an MSA level. This initial dataset allows us to have a clear view on all markets and their attributes per year.
However, since many of our data sources span different years and are typically quite sparse, our combined dataset ends up being relatively large and very sparse with a fill rate of 20% on some of the features.
As a result, we conduct an exploratory analysis on our feature set to understand the missing data.
Exploratory Analysis and Feature Generation
To test our hypothesis that demographics drive market fundamentals, which in turn drive RevPAF growth, we need to ensure our feature set includes as much history of demographics and market fundamentals as possible.
Let’s examine the depth of market fundamentals and demographics data we store:
This diagram shows the overlap of all datasets to be 2009–2015. Though seven years might be sufficient for training, considering the 200+ features and 388 MSAs, it is certainly not enough data to back-test our models.
We have a few choices to expand our dataset:
- Remove Population, Employment (BLS) and Median Income (ACS) in order to have data for every year between 2000–2016
- Impute the missing years worth of data
Knowing the real estate market, removing population wouldn’t seem to make sense because real estate is driven by supply/demand dynamics, and population is clearly an indication of demand. But we need to evaluate that assumption.
We can test how important population and other variables are to RevPAF by trying the following approaches:
- Spearman correlations: Test the relationship between each feature and year ahead RevPAF growth by measuring fit of a monotonically increasing function.
- Recursive feature elimination: Recursively remove a feature and build a logistic regression model on the remaining features while testing the accuracy of the model. Choose the features that most affect accuracy.
- Tree based selection: Apply a Random Forest model to feature set and extract how much each feature decreases the variance of the tree.
To our surprise, demographic features such as affordability and median income outweigh population in terms of having predictive power for RevPAF growth.
The results are not hard to fathom: Population is highly correlated with both demand and supply side market trends, which suggests that eliminating population from our feature set will have a negligible influence on our model.
As a byproduct of this exercise, we are also able to resolve one of the challenges of using multiple data sources that may have different values for the same attribute (e.g. two sources of data might report two different population estimates for a given city in the same year). This process allows us to make a decision on which duplicated demographic source to use. For instance, Employment (3rd party) and Median Income (ACS) tested to have little significance to our models and we therefore discarded them from our feature set. Though the duplicate metrics might have properties that have predictive power, we are only concerned with how effective these variables are in predicting RevPAF growth (our response variable).
The result: We should use Employment (BLS), Median Income (3rd party), and remove Population from our feature set.
Since this dataset is relatively large and consists of many correlated variables, we try running Principal Component Analysis (PCA) to reduce dimensionality of our dataset and avoid falling victim to the Curse of Dimensionality. The idea is to conflate features by finding a linear combination of variables that effectively capture variance of the dataset.
Unfortunately, in our case, 120 features (> 50% of our dataset) are capturing ~60% of the data, so we decide to proceed without PCA.
Although we arrive at a feature set capturing as much relevant demographic and market fundamentals history as possible, the rest of our dataset still suffers from sparsity. We choose to impute these values in order have a feature set that spans from 2000–2017.
Some of the models we wish to run require a full matrix, so we first eliminate all variables with more than 40% missing values, since imputing them would likely create noise. For the remaining features, we try various traditional imputation methods by calculating kNN, mean, or median of a feature in a given year.
We also get creative with non-traditional imputation approaches by realizing that markets in the same time frame with similar population can behave comparably. For example, if the occupancy rate for Chicago 2009 is missing, we can impute it by taking an average of Houston, Philadelphia, and Los Angeles’ 2009 occupancy.
We understand that our imputation approaches are inherently approximate, therefore we test which method to use for each feature by running our model exhaustively and picking the method that yields the lowest back-test error.
Though we have created our feature set, we can’t apply tree-based models in a meaningful way since there is serial correlation in our data. Models like Random Forests build trees by randomly sampling a subset of training data, which wouldn’t take into account the importance of time in our dataset. In our case, we are dealing with information where future observations are definitely affected by past values.
To solve for this, we create a new feature set that includes information from the past in each row or data point in attempt to capture seasonality. Specifically, for each feature, we calculate one, three, and five year growth and its momentum (change in growth). We also add observed RevPAF growth from the year ahead, three years ahead, and five years ahead, as our response variables to train and test against.
The new feature set looks something like:
We apply various time series and tree-based models on our feature set, measuring overall accuracy by computing a Spearman correlation between our predictions and observed results at each step in our cross validation and taking an average.
Conventional cross validation techniques of randomly sampling the dataset into training, validation, and test sets do not work with time series data. This is again because random sampling does not take into account the temporal structure of the dataset. As a workaround, we make use of training windows to cross validate.
Specifically, we test two techniques for time series cross validation. The following examples show our methodology for a one year ahead prediction for 2017 and are inspired by the works of Rob Hyndman. These methods can be extended to predict three and five years ahead:
Forecasting with rolling window: For every year before prediction year (2017), train and test all possible 10 year windows i.e. train 2000–2009 and predict 2010, train 2001–2010 and predict 2011, … , train 2006–2015 and predict 2016. Measure accuracy by taking an average Spearman correlation over testing all predictions.
Forecasting with rolling prediction origin: Train and test every year to date starting with a 10 year window while rolling the prediction origin i.e. train 2000–2009 and predict 2010, train 2000–2010 and predict 2011, … , train 2000–2015 predict 2016. Measure accuracy by taking an average Spearman correlation over all predictions.
Having a consistent methodology to train and test, we now run our models.
We first run a simple Linear Regression model to benchmark our predictions. Over 2000–2017, we achieve a weak average Spearman correlation of 0.33 over all markets. However, we know this is just a starting place.
We now apply various industry-proven machine learning regression models including Random Forests, Extremely Randomized Trees, and XG Boost. We pick the best model and associated hyperparameters by taking the following steps when training:
The model and hyperparameters for our final model with unobserved RevPAF growth are simply the ones that yield the highest average Spearman correlation over our training set.
So what did our approach yield?
Extremely Randomized Trees (ER Trees) performed consistently better than other models for all years and the optimal hyperparameters did not vary much, which implied our approach was stable.
ER Trees is an extension of the Random Forest model where the “extra randomness” is induced by randomly choosing the threshold for each split. This results in an algorithm that generally has less variance, but at the cost of a higher bias. As a result, the algorithm is less prone to fitting to the noise in the training data and may have a better expected test (prediction) error.
We managed to forecast RevPAF a year ahead with an average Spearman correlation of 0.74 over our back-test, which suggests a very strong correlation between our predictions and observed RevPAF growth. This gives us confidence in using our model to estimate future RevPAF growth.
We also found that forecasting with a rolling window yielded better model predictions, since the CRE market is cyclical in nature with ~8–17 year cycles. Forecasting with a rolling prediction origin generally underpredicted RevPAF growth due to the inclusion of the financial crisis in 2008 for every prediction.
What We Learned
The most prominent features to predict RevPAF growth are affordability (Rents/Median Income), employment, permits, stock, and median home prices.
We are also able to perform sanity checks for our projections by visualizing a current market snapshot where MSAs are ordered by decreasing RevPAF growth predictions and features are sorted by decreasing importance. This allows us to have a clear view on the current market status of MSAs we see most potential in, and how their corresponding features rank against all other MSAs today.
For example (with dummy data):
Consistent with our intuition, higher ranked markets generally, but not always, had a higher average rank across all features. This implies that our model results are somewhat in agreement with our hunch: MSAs with strong demographics today do see higher RevPAF growth predictions, but only when demand outpaces supply. After all, a strong and growing market this year is the best predictor of a strong and growing market next year. While market strength may be the best indicator of future performance, as investors we know that prior performance does not guarantee future success.
As a data team, we use the insights gleaned from our research to understand what drives the heuristics that investors often employ. It may be impossible to time the next cycle, but we can certainly use data to help our investors pinpoint market dynamics associated with growth.
Cadre is building the world’s first digital stock market for alternative assets. Interested in becoming an investor with us? Check out our site.