Can data science capture key insights in news articles?

Itua Etiobhio, Riyad Khan and Steve Blaxland

The volume of information available to supervisors from public sources has grown enormously over the past few years, including unstructured text data from traditional news outlets, news aggregators, and social media. This presents an opportunity to leverage the power of data science techniques to gain valuable insights. By utilising sophisticated analytical tools, can supervisors identify hidden patterns, detect emerging events and gauge public sentiment to better understand risks to the safety and soundness of banks and insurance firms? This article explores how data science could support central bank supervisors to discover significant events, capture public trends and ultimately enable more effective supervision.

Using news articles as a source of data

In this article, we investigate if we can identify events of interest, public opinion and other useful insights relating to banks. News articles are a valuable and timely source of varied information, including events such as mergers and acquisitions, economists’ opinions about firms’ business performance, and even emerging threats like bank runs. This makes it a valuable data set which to apply data science techniques to extract key information.

Our data source is Factiva Analytics, a credible news aggregator with sources including The Times, The Telegraph and SNL Financial, housing over 32,000 major global newspapers, industry publications, reports, and magazines. By using an aggregator with credible sources, supervisors can filter out fake news and access reliable information. With trustworthy news stories at their disposal, they can be alerted to potential problems that may require their attention, without making decisions based solely on these stories.

Using Factiva, we extracted news articles about 25 regulated banks of different sizes over the period 1 January 2022 to 21 March 2023, resulting in a data set containing 175,000 articles. Many of these were very similar with only slight textual differences that had been published across multiple distribution channels. By using a data science model named FinBERT, a trained finance language model, we calculated the degree of similarity between different financial articles and generated a similarity matrix. The algorithm treats each article as a vector in a multi-dimensional vector space. The distance between vectors is calculated using cosine similarity and represents the similarity between news articles. The shorter the distance between vectors, the more similar the articles. Those with the highest scores are the most similar in the data set. An example of a single day’s output is shown below.

Chart 1: The cumulative total number of articles that have a similarity score above a threshold for a single day of articles (3 October 2022)

Five articles have a similarity of 1, meaning they are identical, while 130 others have a similarity score of 0.99. Such high similarity between news articles demonstrates why it would be inefficient (as well as unrealistic) for supervisors to try consuming all such data. By setting the similarity score threshold at 0.99, we removed highly similar articles from the data set. Applying this method, along with filtering out regulatory articles, news summaries, local news, we reduce the total number of articles by 45% ensuring supervisors can use their time more effectively, focusing only on unique articles related to their firms.

Credit Suisse case study

To test our approach, we looked at Credit Suisse, a firm with a large corpus of news data that had gone through a turbulent period over the last few years. The test was carried out in hindsight. In reality, we expect any such analysis to be carried out in ‘real-time’.

UBS announced it would acquire Credit Suisse on 19 March 2023, ahead of which there was a cascade of rumours and information communicated through traditional news outlets and social media. To understand this, we used network analysis, PageRank and keyword data science techniques to identify and analyse any events of interest over a 15-month time period.

Network analysis

The use of network analysis provides a way to explore the interconnectedness of banks through global media. The primary assumption is that the co-appearance of banks in news articles reveals a connection between them. Each news article forms the root of a directed acyclic graph (DAG), with nodes created for every other bank mentioned within the same article. A visualisation of a network with Credit Suisse at the heart of the analysis is shown below.

Figure 1: Network analysis on Credit Suisse

In Figure 1, the strength of the link between any two banks is determined by the number of news articles in which both banks are mentioned, while the direction of the arrow represents the direction of the narrative flow. For example, the arrow pointing from Credit Suisse towards UBS represents that Credit Suisse has been identified as the primary subject in the corpus of articles and the topic being its acquisition by UBS.

We conducted sentiment analysis on each news article to measure overall positive or negative sentiment towards the banks involved. The sentiment value is then attributed to the corresponding link in the network, represented by the colour of the connection, with red being negative and blue positive sentiment. An example in the above diagram shows Credit Suisse and UBS are identified to have a strong connection with a negative sentiment.

This method, leveraging Artificial Intelligence (AI) to create a network of connections and sentiments, can provide value to supervisors. This technique enables us to understand the patterns of interconnectivity between banks and how this changes over time, as a way of tracking and understanding unfolding events, and potential knock-on consequences from counterparty risk. Additionally, sentiment analysis can act as an early warning indicator, with shifts in sentiment often indicating significant market events.

Keyword analysis

Using keyword analysis, we tagged articles with a theme that are of interest to us to produce a themed timeline. Spikes in the volume of articles can indicate an event of interest. Through manually reading a subset of news articles, two themes occurred frequently:

Change in management.
Change in credit rating.

We conducted analysis to show the volume of articles related to these themes by using a list of keywords we created. A sample of key events are tagged in the charts below.

Chart 2: Credit Suisse timeline – change in management

Notes: Chart shows the number of articles per week from 1 January 2022 to 21 March 2023. Colours represent number of articles related to a keyword.

Chart 3: Credit Suisse timeline – credit rating

Chart 3 shows how we can identify news articles and events that could indicate financial stress. Supervisors can spot spikes in the timeline and decide to investigate further. Spikes in the volume of such articles can be used to gauge the scale of the event. The more news articles discussing the same topic, the bigger the event.

Identifying key news titles

As a complement to the above indicators, it can be helpful to identify the key news titles within the corpus of documents being analysed. PageRank is an unsupervised algorithm based on graph theory, originally designed for ranking web pages, that has been adapted for identifying important sentences in text, based on their semantic similarity in the document. The algorithm treats each news title as a node in a graph and uses cosine similarity to calculate the distance between nodes. The shorter the distance, the more similar the titles, with the highest scores considered to be the most important and representative in the data set.

Table A: Key news titles on Credit Suisse in 2022

Table A illustrates in 2022 Q4 and Q3, news flow around Credit Suisse shows a handful of major themes including losses, management, and decreases in its share price – which were not apparent in Q1 and Q2.

This approach can enable supervisors to quickly zero in on the most significant information in news articles, saving time and effort compared to manually reading and summarising each article. The extracted key titles can be used for various purposes, including monitoring news coverage and tracking market sentiment.

Conclusion

Leveraging data science techniques to identify event-driven insights from news articles can be a valuable input to judgement-based supervision.

In this article, we showed how network analysis and complementary methods can identify events of interests and a handful of key themes relating to single firm Credit Suisse. The power of such analysis is scalability ie similar analysis can be applied to multiple firms and across industries and jurisdictions regularly supporting efficient and effective supervision. However, there are limitations and challenges, including incorporating insights from articles written in multiple languages. In our sample, 60% of the articles from Factiva are non-English and these are not included in our analysis here. Currently Factiva doesn’t provide translation on articles.

Rapid developments in other AI fields, such as natural language models, could provide further valuable insights. For example:

Text-summarising models such as Large Language Models (LLMs) and cloud technology summarisation tools using Microsoft Azure, Google and AWS can extract key information from documents enabling supervisors to read key points rather than whole articles.
Translating non-English articles to English to gather further insights.

With data science methods improving along with powerful cloud computing, these techniques have the potential to perform these complex tasks with increased accuracy.

This post was written while Itua Etiobhio was working in the Bank’s RegTech, Data & Innovation division. Riyad Khan and Steve Blaxland work in the Bank’s RegTech, Data & Innovation division.

If you want to get in touch, please email us at [email protected] or leave a comment below.

Comments will only appear once approved by a moderator, and are only published where a full name is supplied. Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.

Can data science capture key insights in news articles?

Like this: