Does a change in news sentiment predict a change in the stock price?

This is the holy grail question of algorithmic trading. As an engineer moving into the AI space, I wanted to test this empirically. My initial plan was simple: build a pipeline, plug in the industry-standard Financial BERT model (“FinBERT”), and watch the insights roll in.

But before deploying this to production, I decided to run a stress test in my personal R&D sandbox. I called it “The Reality Check.”

The results forced me to rethink my entire architecture.

The Hypothesis: “One Model Fits All”

In the world of Financial NLP, models like FinBERT (ProsusAI) are the gold standard. They are pre-trained on massive corpora of financial news, earnings calls, and analyst reports.

My hypothesis was straightforward: If a model is trained on “financial language,” it should work equally well on a Bloomberg headline and a Reddit thread.

To test this, I built a benchmarking framework in Python to pit 5 different models against two very different datasets:

  1. Formal News: Financial Phrasebank (Clean, editorialized text).
  2. Social Media: Twitter Financial News (Messy, sarcastic, slang-heavy).

The Experiment

I used the Hugging Face transformers library to load a diverse collection of models, ranging from specialized financial experts to generalist transformers:

  • ProsusAI/finbert (The Banker)
  • cardiffnlp/twitter-roberta (The Socialite)
  • distilbert-base-uncased (The Generalist)

The challenge wasn’t just running the models; it was normalizing them. Some models output [Positive, Negative, Neutral], others output [Label_0, Label_1]. I wrote a normalization engine to map every output to a standard schema so I could compare apples to apples.

The Result: The Accuracy Gap

When I visualized the data using Plotly, the “One Model Fits All” hypothesis fell apart.

  • On Formal News: FinBERT was a genius. It achieved ~97% accuracy, correctly identifying that “Profit rose by 5%” is positive.
  • On Social Media: FinBERT crashed. Its accuracy dropped to ~30%.

Why? Because FinBERT doesn’t speak “Internet.”

When a user on Twitter says, “My portfolio is bleeding but I have diamond hands 💎,” a traditional financial model sees the word “bleeding” and predicts Negative. But any crypto trader knows “diamond hands” implies a stubborn, bullish conviction (Positive).

The Lesson: We Need an Ensemble

This experiment proved that in AI engineering, domain expertise is not enough; context expertise matters.

A model trained on the Wall Street Journal cannot navigate r/WallStreetBets. This “Reality Check” saved me from deploying a flawed system that would have misread 50% of the market’s signals.

What’s Next?

The failure of the single-model approach led me to design a Multi-Agent Quorum. Instead of relying on one brain, I am now building an architecture where:

  1. “The Banker” Agent handles news.
  2. “The Socialite” Agent handles tweets.
  3. A “Meta-Agent” resolves the conflicts.

You can check out the code for this benchmark and follow the development of the Agentic Quorum (see 02_Agent_Quorum_POC.ipynb) in my GitHub repository.