Sentiment Analysis as a Trading Indicator

Alexander Özkan · March 28, 2020

Exploring the news as an indicator for trends in the stock market

I’d like to preface this by saying that I do not have a finance background in any form and this is simply an exploratory look.

I’d also like to say that this is a code heavy analysis, but you can skip to the bottom for the conclusion and plotting.

That aside, in order to see what impact the news has on the global stock markets we’ll need to gather some data.

We’ll start with the S&P 500 as our sample of stocks. I’ve sourced this dataset from datahub.io. It’s a little out of date, so we’ll need to verify that all of the company tickers/symbols still exist.

Let’s begin by getting some of the imports we need:

import pandas as pd
import matplotlib.pyplot as plt

The S&P data is in a .csv file, so we can import it and take a look using Pandas quite easily:

sp = pd.read_csv('constituents.csv')
print(sp.shape)
sp.head(5)

(489, 3)

	Symbol	Name	Sector
0	MMM	3M Company	Industrials
1	AOS	A.O. Smith Corp	Industrials
2	ABT	Abbott Laboratories	Health Care
3	ABBV	AbbVie Inc.	Health Care
4	ACN	Accenture plc	Information Technology

As you can see we’ve got 492 rows and 3 columns, the symbol/ticker, name of the company, and sector the company operates in.

Let’s see what sectors comprise the S&P:

print(sp['Sector'].value_counts(normalize=True))

Consumer Discretionary        0.163599
Information Technology        0.141104
Financials                    0.139059
Industrials                   0.134969
Health Care                   0.122699
Real Estate                   0.065440
Consumer Staples              0.065440
Utilities                     0.057260
Energy                        0.057260
Materials                     0.047035
Telecommunication Services    0.006135
Name: Sector, dtype: float64

sector_counts = pd.value_counts(sp['Sector'].values, sort=True)
sector_pie = sector_counts.plot.pie(autopct='%.0f%%')
sector_pie.set_ylabel('')

Text(0, 0.5, '')

png

I’ve normalised the values and we can see that consumer discretionary based companies are leading (at the time of this data), with 16% of the S&P 500 being companies in that sector.

Querying for Price and Volume

We’re going to need to get the daily price for each stock, as well as information about how many shares per day are being traded.

We can use a free API by Alpha Vantage for this. They’ll give us 500 free requests per day. Just enough for our use!

I’ve written a script to query the API for a 20 day historic list of the price the equity opened at, price it closed at, as well as the high/low and the volume for that day. This information is then inserted into an SQL database for later use.

All code is viewable here on my GitHub

The script was written to account for any no longer existing tickers (via reading the API errors), and the following were identified and removed:

    Symbol                            Name                  Sector
  ANDV                        Andeavor                  Energy
   BBT                BB&T Corporation              Financials
  BF.B              Brown-Forman Corp.        Consumer Staples
 CSRA                       CSRA Inc.  Information Technology
  DPS         Dr Pepper Snapple Group        Consumer Staples
 EVHC             Envision Healthcare             Health Care
  GGP  General Growth Properties Inc.             Real Estate
  LLL     L-3 Communications Holdings             Industrials
  MON                    Monsanto Co.               Materials
  NFX         Newfield Exploration Co                  Energy
  PXD       Pioneer Natural Resources                  Energy
  TWX                Time Warner Inc.  Consumer Discretionary
  WYN               Wyndham Worldwide  Consumer Discretionary

Let’s take a look at the freshly gathered equity data:

import sqlite3

eq_db = sqlite3.connect('equities.db')
stock_info = pd.read_sql_query("SELECT * FROM equities_daily", eq_db)
stock_info = stock_info.set_index('ticker')

stock_info.head(5)

	date	open	high	low	close	volume
ticker
MMM	2020-03-26	131.79	136.38	130.61	136.18	6693932.0
MMM	2020-03-25	133.15	134.69	126.80	131.54	7740084.0
MMM	2020-03-24	122.29	133.45	121.00	132.72	9304832.0
MMM	2020-03-23	128.16	128.40	114.04	117.87	7920348.0
MMM	2020-03-20	138.07	139.24	122.71	124.89	9582251.0

Querying for News

A reliable source of news that can be queried by an application can get expensive quickly. Thankfully there’s a few free/low cost solutions.

For this project I’ll be using NewsAPI.org. Similar to Alpha Vantage we get 500 free requests a day. This allows for a pretty granular search critera and has plenty of reliable news sources.

We’re only interested in the article’s published date, url, title, and description. In the same script as above on my GitHub we are parsing and inserting the relevant data to an SQL database for later use. For now it’s the previous 20 days worth of news.

Here’s a look at that news data:

news_db = sqlite3.connect('news.db')
news_info = pd.read_sql_query("SELECT * FROM news_daily", news_db)
news_info = news_info.set_index('ticker')

news_info.head(5)

	date	url	title	description
ticker
MMM	2020-03-24	http://www.marketwatch.com/story/cvs-plans-to-...	CVS plans to hire furloughed workers from its ...	CVS plans to hire furloughed workers from its ...
MMM	2020-03-18	https://www.marketwatch.com/story/biggest-make...	Biggest maker of face masks in U.S. is warning...	The biggest maker of medical face masks in the...
MMM	2020-03-19	https://news.ycombinator.com/item?id=22623807	Ask HN: How should I invest $200K in this market?	Comments
MMM	2020-03-03	https://www.fool.com/investing/2020/03/03/why-...	Why Shares of 3M Are Down Today	One of the better-performing industrials throu...
MMM	2020-03-10	https://seekingalpha.com/article/4330869-is-no...	It Is Not About 3M, It Is About You	3M is a great businesses. This makes investing...

Let’s take a closer look at something other than 3M:

news_info.loc['CSCO', 'title']

ticker
CSCO    Why Cisco Systems Stock Slumped 13.1% in February
CSCO    Cisco committing $225M to global coronavirus r...
CSCO    Why Cisco Stock Is Becoming Attractive Followi...
CSCO    Deep Dive: You can be ‘practically stealing’ q...
CSCO    Hedge Funds Have Never Been This Bullish On Ci...
CSCO    Deep Dive: These stocks may be your best choic...
CSCO    Raymond James and Cisco to Host a Tech Talk on...
CSCO    Deep Dive: These stocks soared the most after ...
CSCO    Deep Dive: Here are Thursday’s best-performing...
CSCO    Coronavirus school cancellations lead to educa...
CSCO                        Tech Is The Solution - Nasdaq
CSCO                 Tech Is The Solution - Yahoo Finance
CSCO    Deep Dive: These stocks took the biggest hit a...
CSCO    Oxbotica and Cisco to Solve Autonomous Vehicle...
CSCO    Oxbotica and Cisco to Solve Autonomous Vehicle...
CSCO    Microsoft Teams Adds 12M Customers In A Week A...
CSCO    Where Tech Stock Valuations Stand Following a ...
CSCO    Deep Dive: These U.S. stocks fell the most aft...
CSCO    Datadog Stock Finds Support Amid Coronavirus -...
CSCO                    Cisco begins new round of layoffs
Name: title, dtype: object

In case you haven’t seen the full code for how I’m sourcing this news, it’s worth noting that I’m searching for both the ticker and full company name when it comes to news.

This means that even for easily misunderstood tickers that we’re still getting accurate news:

news_info.loc['A', 'title']

ticker
A    You Have To Love Agilent Technologies, Inc.'s ...
A    Agilent Technologies to Adjourn Annual Meeting...
A    Agilent Technologies Announces Cash Dividend o...
A    Agilent Technologies Announces Webcasts for In...
A    Is Agilent Technologies Inc. (A) Going To Burn...
A    Kim Kardashian y Kylie Jenner dejan de vender ...
A    Agilent Introduces CrossLab Connect Services f...
A       Agilent Receives Two Scientists’ Choice Awards
A    Agilent Introduces Three New Microarrays for P...
A    Agilent and Visiopharm Co-promote Advanced Dig...
A    Bill Ackman Continues To Chip Away At Largest ...
A    Is Agilent Technologies Inc. (A) Going To Burn...
A    World Flow Cytometry Industry Outlook, 2020-20...
A    Hedge Funds Have Never Been This Bullish On Sq...
A    Worldwide Genomics Markets, 2020-2027 - Compre...
A    The global protein sequencing market is antici...
A    Asia Pacific Genomics and Proteomic Tools Mark...
A    Global Environmental Sensing and Monitoring Te...
A    Proteomics Industry Analysis, 2020-2026 - Outb...
A    Global DNA Sequencing Market (2020 to 2024) - ...
Name: title, dtype: object

Sentiment Analysis - VADER

Now that we’ve got a small bit of data compiled, we can start to explore it further and see if there’s any correlations.

The first problem we’ll face is choosing how to determine the sentiment of a news article. The field of natural language processing and sentiment analysis is massive and generally complex.

I’m going to start this project off by using a simplified method of sentiment analysis - the Valence Aware Dictionary and sEntiment Reasoner (VADER) method. This is a rule-based/lexicon technique. It’s particularly good for social media related content, but we’ll test it out on news headlines now.

Normally for sentiment analysis you would remove stopwords (filler words of no value) from the text you are processing, but in the case of VADER it is advantageous to leave them in.

Let’s define our sentiment analysis function:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

def determineSentiment(title):
    """
    Calculates the weighted sentiment of a piece of text
    Returns a string which is the determined score of sentiment
    """

    # VADER Polarity Score of Sentiment
    sia = SIA()
    results = []

    pol_score = sia.polarity_scores(title)
    pol_score['news_text'] = title
    results.append(pol_score)

    ## Check the compound result of the analysis
    ## Tolerances are near mirrors of those used in academia (though more sensitive to negativity)
    compound = results[0]['compound']
    if compound >= 0.05:
        return 'Positive'
    elif compound >= -0.02 and compound < 0.05:
        return 'Neutral'
    elif compound < -0.02:
        return 'Negative'

Now let’s pass in some titles and see how it classifies them:

positive_title = news_info.loc['A', 'title'][0]
print("Title: \"{}\" has sentiment: {} \n".format(positive_title, determineSentiment(positive_title)))

neutral_title = news_info.loc['F', 'title'][0]
print("Title: \"{}\" has sentiment: {} \n".format(neutral_title, determineSentiment(neutral_title)))

negative_title = news_info.loc['GPS', 'title'][2]
print("Title: \"{}\" has sentiment: {} \n".format(negative_title, determineSentiment(negative_title)))

Title: "You Have To Love Agilent Technologies, Inc.'s (NYSE:A) Dividend" has sentiment: Positive 

Title: "Factbox: Ford and General Motors' electric vehicle plans" has sentiment: Neutral 

Title: "The Ratings Game: Gap, Banana Republic at risk as coronavirus gives shoppers one more reason to avoid the mall" has sentiment: Negative 

As you can see from the three above examples, it is capable of identifying particularly clear sentiment. However, it isn’t too good with the nuances of financial language as it is not context aware:

unclear_title = news_info.loc['GPS', 'title'][7]
print("Title: \"{}\" has sentiment: {}".format(unclear_title, determineSentiment(unclear_title)))

Title: "The Gap Inc. (GPS): These Hedge Funds Caught Flat-Footed" has sentiment: Neutral

Now let’s modify our sentiment analysis function to begin checking batches of news:

def determineSentiment(title):
    """
    Calculates the weighted sentiment of a piece of text.
    
    Returns a float 'compound' which is the aggregate score of positive, 
    negative and neutral sentiment
    """

    # VADER Polarity Score of Sentiment
    sia = SIA()
    results = []

    pol_score = sia.polarity_scores(title)
    pol_score['news_text'] = title
    results.append(pol_score)

    return results[0]['compound']

And let’s create a function to compile the sentiment of all news for a given ticker:

def generateSentimentDict(ticker):
    """
    Generates a dictionary with a key of YYYY-MM-DD 
    and a value of the aggregate news sentiment for that day
    """
    daily_sentiment = {}

    articles = news_info.loc[ticker, 'title']
    for article in articles:
        sentiment = determineSentiment(article)
        
        #if sentiment >= 0.05 or sentiment <= -0.05:
        # Determine the date that the article was published on
        row_with_date = news_info.loc[news_info['title'] == article]
        date = row_with_date['date'][0]

        # Add the calculated sentiment to the dict of dates
        if date in daily_sentiment:
            daily_sentiment[date] += sentiment
        else:
            daily_sentiment[date] = sentiment

    return daily_sentiment

Investigating the Trends

Now it’s time to gather all of our combined data and begin to plot and investigate things further.

First let’s generate a dictionary of the news sentiment for the last month in regards to the Boeing Company (BA):

boeing_sentiment = generateSentimentDict('BA')
print(boeing_sentiment)

{'2020-03-10': 0.5574, '2020-03-17': -0.3182, '2020-03-23': -1.0513, '2020-03-11': 0.5994, '2020-03-06': 0.0, '2020-03-12': -0.296, '2020-03-18': -0.6249, '2020-03-21': 0.6597, '2020-03-03': 0.296, '2020-03-20': 0.296, '2020-03-15': 0.0, '2020-03-16': 0.0, '2020-03-24': 0.0, '2020-03-04': -0.0258, '2020-03-05': 0.0, '2020-03-26': 0.34, '2020-03-25': 0.1796, '2020-02-27': 0.0}

We can now plot the sentiment as a time series:

# Sort items from dict in ascending time and plot
date,sentiment = zip(*sorted(boeing_sentiment.items()))
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.xticks(rotation=45)
plt.plot(date,sentiment)

[<matplotlib.lines.Line2D at 0x7f73eb37d7d0>]

png

Let’s plot Boeing’s share price:

from datetime import datetime

# Pull in our 20 day pricing information earlier compiled
stock_info = pd.read_sql_query("SELECT date, close FROM equities_daily WHERE ticker == 'BA'", eq_db)

# Convert the date to the datetime format for plotting
stock_info['date'] = pd.to_datetime(stock_info['date'], format="%Y/%m/%d")

date = stock_info['date']
price = stock_info['close']
plt.xlabel('Date')
plt.ylabel('Share Price')
plt.xticks(rotation=45)
plt.plot(date, price, color='red')

[<matplotlib.lines.Line2D at 0x7f73eb228710>]

png

Now let’s overlay the two time series:

# Pull in our 20 day pricing information earlier compiled
date = stock_info['date']
date = date.iloc[1:]
price = stock_info['close']
price = price.iloc[1:]

# Convert sentiment from unordered dict to datetime ordered dataframe
sentiment_df = pd.DataFrame(boeing_sentiment.items(), columns=['date', 'sentiment'])
sentiment_df = sentiment_df.sort_values(by='date', ascending=False)
sentiment = sent_pd['sentiment']

fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('Date')
ax1.set_ylabel('Share Price', color=color)
ax1.plot(date, price, color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.tick_params(axis='x', labelrotation=45)

ax2 = ax1.twinx()

color = 'tab:blue'
ax2.set_ylabel('News Sentiment', color=color)
ax2.plot(date, sentiment, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.show()

png

Interestingly, there seems to be a common trend between the two data sets.

However, there seems to be a strange spike in sentiment between 2020-03-17 and 2020-03-21.

After delving into the sentiment values generated in the dictionary above we can see that on 2020-03-21 the sentiment was 0.6597 and on 2020-03-20 it was 0.296.

Let’s find those titles:

news_info_ba = news_info.loc['BA']
print(news_info_ba.loc[news_info_ba['date'] == '2020-03-20']['title'][0])
print(news_info_ba.loc[news_info_ba['date'] == '2020-03-21']['title'][0])

Boeing halts dividend, share buyback program amid coronavirus pandemic - MarketWatch
Outside the Box: Coronavirus bailouts are coming: Here’s the smart way to help businesses and workers

“Boeing halts dividend, share buyback program amid coronavirus pandemic - MarketWatch”

This is questionable in sentiment, although marked as strongly positive. In reality I’d argue this is negative as a halting of dividends indicates a poor cash situation.

“Outside the Box: Coronavirus bailouts are coming: Here’s the smart way to help businesses and workers”

This seems like an opinion piece, and after further investigation it is. It’s a positive sentiment but not clearly relevant.

Conclusion

It would seem that there is indeed use in analysing news as a trading indicator. However I do not think it should be the only factor when considering a trade.

If those two articles were removed from the dataset we’d end up with a nearly perfect overlap of our lines. However manually intervening would be a poor choice as it would be impractical in a production system. This problem should really be fixed by using a better sentiment analysis method.

There’s many ways to improve on this:

The analysis was during one of the most volatile markets we’ve had in many months/years
The analysis was a very small period in length (~ one month)
VADER is optimized for the sentiment of Tweets, not financial news headlines
News is generally lagging with respect to the price, and markets may not always respond to the news how we think they might

Ideally I’d like to have access to more news, and certainly consider training something like a Naïve Bayes model for my sentiment determiner.

To train such a model would require a considerable amount of labeled data, which would have taken a long time to compile and hand label.

As well as that, I think this strategy could work if paired with other signals to determine a longer term sentiment. Perhaps a modified turtle strategy that uses news as one of its indicators, only trading when the sentiment is positive for multiple consecutive days (or vice versa).

Thanks for reading!

Share: Twitter, Facebook