Point-in-Time Equity Return Prediction Using Interpretable Machine Learning

rosy851018
3月29日
讀畢需時 3 分鐘

Can We Predict Stock Returns with Data?

Introduction

Stock returns are often considered unpredictable, especially in the short term. Prices move quickly, and daily changes are often noisy and random.

In this project, I explore a key question:

Can we use data to predict whether stock prices will go up or down?

To answer this, I combined price data and company fundamentals, and built machine learning models to test predictability across different time horizons.

Step1: Building a Clean Dataset

To avoid unrealistic results, I used a Point-in-Time (PIT) approach. This ensures that the model only uses information that was available at that time.

merged = pd.merge_asof(
    prices,
    fundamentals,
    left_on='Date',
    right_on='Report Date',
    by='Ticker',
    direction='backward'
)

This avoids data leakage from future information, which is a common mistake in financial modeling.

Step2: Turning Raw Data into Signals

Raw data alone is not enough. The key is to turn it into meaningful features.

I started with simple price movements:

panel["ret_1d"] = panel.groupby("Ticker")["Close"].pct_change()
panel["vol_chg_1d"] = panel.groupby("Ticker")["Volume"].pct_change()

Then I added momentum signals to capture trends:

for w in [5, 20, 63]:
    panel[f"momentum_{w}"] = (
        panel.groupby("Ticker")["Close"].pct_change(w)
    )

These features help the model understand how prices move over time.

Step 3: Capturing Market Trends

Markets often follow trends, so I added moving average signals:

panel["ema_12"] = panel.groupby("Ticker")["Close"].transform(
    lambda s: s.ewm(span=12).mean()
)

panel["ema_26"] = panel.groupby("Ticker")["Close"].transform(
    lambda s: s.ewm(span=26).mean()
)

panel["ema_cross"] = panel["ema_12"] - panel["ema_26"]

This captures whether the market is in an uptrend or downtrend.

Step 4: Adding Company Fundamentals

Stock prices are not only driven by market behavior—they also depend on company performance.

So I added fundamental features:

panel["eps"] = panel["Net Income"] / panel["Shares (Diluted)"]
panel["profit_margin"] = panel["Net Income"] / panel["Revenue"]
panel["rev_growth_qoq"] = panel.groupby("Ticker")["Revenue"].pct_change()

This allows the model to learn from both price behavior and business fundamentals.

Step 5: Defining the Prediction Task

Instead of predicting exact returns, I simplified the problem:

Will the stock go up or down?

def make_label(df, h):
    future_ret = (1 + df["ret_1d"]).shift(-1).rolling(h).apply(lambda x: x.prod() - 1)
    return (future_ret > 0).astype(int)

This turns the problem into a binary classification task, which is more stable.

Step 6: Building the Model

I chose a simple but powerful model: logistic regression.

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(
        penalty="l1",
        solver="liblinear"
    ))
])

L1 regularization helps:

Remove unimportant features
Keep the model interpretable

Step 7: Comparing Models

I also tested other models:

models = {
    "LogReg": LogisticRegression(),
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

Interestingly, all models performed similarly.

This shows that: Good data and features matter more than complex models.

What Did I Find?

The results show that short-term predictions (1 day) are almost random, while performance improves significantly at the 20-day horizon and becomes highly predictable over 60–120 days. As the time horizon increases, market noise decreases, trends become clearer, and fundamentals play a more important role.

Turning Predictions into Strategy

To test the model in practice, the predictions were used to build a simple trading strategy: go long when the predicted probability is high, and stay in cash otherwise. The results show higher returns, lower drawdowns, and more stable performance compared to buy-and-hold. This suggests that the model captures useful patterns that can be applied in real trading.

signal = (proba > 0.55).astype(int)
strategy_return = signal * R_test.values

Final Thoughts

Strong results don’t always require complex models. Clean data, good features, and a clear problem matter more. Even in noisy markets, data-driven methods can still uncover useful signals—especially over longer horizons.

Key Takeaways

Stock returns are hard to predict in the short term
Predictability improves over longer horizons
Technical signals dominate short-term predictions
Fundamentals matter more in the long run
Simple models can still be very powerful

Technical Skills Used

Python & Data Analysis: pandas, numpy
Machine Learning: Logistic Regression (L1/L2), Random Forest, XGBoost
Quantitative Methods: Time series modeling, return forecasting, feature selection
Feature Engineering: Momentum, volatility, EMA signals, financial ratios
Data Engineering: Point-in-Time (PIT) data pipeline, data cleaning
Model Validation: Accuracy, ROC-AUC, cross-validation, backtesting
Visualization: Matplotlib, Seaborn

References

N. Jegadeesh and S. Titman, “Returns to Buying Winners and Selling Losers,” Journal of Finance, vol. 48, no. 1, pp. 65–91, 1993.
M. Carhart, “On Persistence in Mutual Fund Performance,” Journal of Finance, vol. 52, no. 1, pp. 57–82, 1997.
C. Asness, T. Moskowitz, and L. Pedersen, “Value and Momentum Everywhere,” Journal of Finance, vol. 68, no. 3, pp. 929–985, 2013.
N. Jegadeesh and S. Titman, “Momentum,” Review of Financial Studies, vol. 23, no. 2, pp. 793–826, 2011.
J. B. Heaton, N. G. Polson, and J. H. Witte, “Deep Learning in Finance,” Journal of Political Economy, vol. 129, no. 6, pp. 197–241, 2001.