All Insights

Risk and Reward: The Reliability of Alternative Data

Alternative data, especially sentiment-based analytics, is taking off, but the question remains: How reliable is it really?

20 Aug 2020 – Market Insights
Vast quantities of unstructured data make it difficult for algorithms to separate signal from noise.

The alternative data boom

As economic uncertainty from the COVID-19 crisis has companies hunting for dynamic and real-time updates, many have turned to alternative datafor an information edge, and the market is only expected to grow.

The power of harnessing alternative data lies in timely and granular results that can construct a holistic view of a company’s performance from multi-dimensional information sources, like social media and news updates, consumer credit card purchase data, auto insurance logs, AWS usages and more. As social media, news, and blogs become an increasingly integral part of our everyday life and business activities, visionary companies and funds are leveraging sentiment analysis to innovate their investment models

Social media sentiment analysis is rooted in natural language processing and machine learning algorithms. It can process millions of social media posts a day, assigning each post a score to indicate its polarity: positive, negative, or neutral. The results are then consolidated into a sentiment score that yields the general social sentiment on a particular object such as a company or brand.

But despite the competitive information advantage sentiment analysis can provide, it comes with its fair share of risk.

The big risk of alternative data — and how to maximize reward

Let’s look at the challenges of getting high-quality social media sentiment analysis.

Compared to traditional data, social media sentiment data is considered to be more granular, but also harder to parse or draw useful signals. The sheer volume of dynamic data generated online every day is enormous. Twitter alone has 500 million tweets a day; Reddit has over 2 million daily comments. All these data are real-time and dynamic, meaning public sentiment manifested in the data can change from one hour to the next.

Because raw data is useless until interpreted and analyzed accurately, data processing structures have to sift through huge quantities of unstructuredand heterogeneous data (e.g. information embedded in text, images, or video), which are notoriously difficult to handle.

Semantic ambiguity is another problem: for example, when people talk about banana (the fruit) and Banana Republic (the brand), it can be difficult for the machine to recognize which one is relevant. The circulation of spammers, fake accounts, and unreliable news pose another challenge, while an isolated set of data points means a lack of comprehensive information, making it difficult to draw cohesive insights.

Altogether, these vast quantities of unstructured data make it difficult for algorithms to separate signal (desired information) from noise (background or misleading information). Noisy data can easily lead to illusory trends and false analyses.

So, how can we effectively separate useful signals from large amounts of noise, and ensure data reliability?

First, we employ a deep-learning multi-category classification system to sort and categorize the named entities. The semantic context is considered to train our classification algorithms, so it can successfully distinguish different entities based on their semantic meaning; for example, our machine can differentiate banana the fruit from Banana Republic the brand, or Canada goose the animal from Canada Goose the apparel company.

Second, we remove unreliable sources of information, such as spammers and repetitive texts. The key is developing models that ensure data accuracy, otherwise, the results would be “garbage in, garbage out.” Our machine learning models remove spammers and bots, as well as redundant information, such as news stories or posts that have been reported or reposted dozens of times.

Third, we acknowledge that industry experts and influencers have more knowledge on the industry and thus exert more impact on the market than average users, meaning expert opinion can show more signals than others. We identify and rank the most relevant experts and influencers among millions of users using a network metric computation model and semantic content analysis. The experts and influencers’ insights can generate more granular and accurate analytics and signals for market players.

As an investor, the most effective way to integrate alternative data is to marry it with traditional data. When used to enhance traditional sources of data, alternative data can help investors improve efficiency and productivityunearth hidden insights from online data and predict performance ahead of the quarterly reports and analysts estimates. It can also effectively identify and alert investors with the early signs of market changes to dynamically adjust the investment strategies.

Wissee’s knowledge graph technology comes into play here to help us connect the multi-dimensional data points into actionable insights that investors can use to gain alpha advantages and outperform competitors, especially in uncertain environments such as COVID19.

The age of alternative data is here. How will you leverage its advantages?

Authors:

Dr. Qingqing Han, Co-Founder, Wissee

Dr. Huina Mao, Founder, Wissee

*With thanks to Maggie Yu for her contributions