Synthetic data: The future of business decision-making?

Synthetic data offers a compelling solution to the challenges of data collection and analysis. But can AI-generated C-suite insights truly transform how companies make crucial decisions? B2B International's Amanda Chew explores.

by Amanda Chew | 07/09/2024

Advancements in Natural Language Processing (NLP) models and generative artificial intelligence (GAI) models have fundamentally changed the way that we think of human interaction—think AI chatbots and smart assistants. With large language models (LLMs) like OpenAI’s GPT-4 passing the Turing test, we are more inclined to attribute human qualities to data than ever before.

Enter synthetic data. Synthetic data refers to data generated by AI based on real data sets. Originally developed to create more training data for machine learning algorithms, the ability to shape the created data allowed data scientists to mitigate weaknesses inherent to training algorithms on real datasets. With the widespread adoption of synthetic data, Gartner expects that synthetic data will completely overshadow real data in AI models by 2030.

The value of synthetic data in business decision-making

Increasingly, synthetic data is touted as the solution for businesses when it comes to data collection for audience research. Typically generated through LLMs, applications include predicting the behaviour of audiences that were previously difficult to reach, being able to query data of sensitive audiences without data privacy concerns and simulating business scenarios without requiring advanced analytical models.

Imagine the value of getting the opinion of 10 AI-generated C-Suite stakeholders to help guide a business decision with minimal costs; synthetic data presents a compelling solution to make the problems associated with data collection and analysis go away.

However, how reliable is synthetic data capable of simulating how stakeholders behave? And specifically, is this sufficiently reliable that businesses can rely on it to guide important business decisions?

Capturing nuances in preferences and attitudes

Let’s address the elephant in the room here—experiments conducted by various organisations show results as promising as a 95% match between synthetic data and survey results. This suggests that synthetic data is capable of strongly mirroring real survey results. However, we need to remember that synthetic data is, at the heart of it, an approximation of reality, albeit a close one.

Emporia Research conducted an experiment where they compared their survey data to two sets of synthetic responses generated by GPT-4, specifically AI-generated IT-Decision Maker personas and personas generated off LinkedIn Profile data of identified decision makers. While the data showed a similar directional trend, both synthetic data models were generally overly optimistic compared to survey data, and individual responses fell within a narrower range of responses than human responses. This perhaps indicates there is not only a loss of nuance but, more importantly, the loss of variance that allows us to dive deeper into audiences and understand underlying reasons driving their behaviour.

For example, primary research we’ve conducted with early-stage startups about their data protection practices allowed us to size the frequency and magnitude of challenges, which more positively responding models would likely not highlight. These findings allowed the client to optimise their product proposition at a relatively early stage and bring the right message to the right audience.

Replicating or challenging pre-conceived notions?

Beyond the similarities in outputs produced by synthetic data compared to real datasets, another essential issue to consider is the availability of real and accurate data to generate synthetic data in the first place.

Part of our fascination with LLMs is how they appear to understand and generate informed knowledge on seemingly infinite topics. They are trained on vast quantities of text data; GPT-4, the most familiar to us, is trained on publicly available information and licensed third-party data. With a seemingly infinite trove of data, we almost forget that LLMs are not all-knowing; in fact, they are only as good as the quantity of good-quality data they are trained with.

Deriving insights among difficult-to-reach populations is a benefit frequently raised by advocates for synthetic data. However, this means there is also likely to be limited data available on these groups. When there is insufficient information within a dataset, we risk amplifying incomplete information about under-represented populations.

Researchers have found that AI language models are “not able to model judgements of people whose cultures are not represented in their dataset”, with specific models having their own biases based on how they were built. Referencing the popular GPT models, the study highlighted that the views of the liberal, higher-income, and highly educated were overrepresented. This means that we risk repeating biases based on certain worldviews in decision-making.

On the other hand, the boots-on-the-ground approach in primary research ensures that we continue to challenge held perceptions. For example, we recently ran a qualitative research programme in the agriculture industry, where the demographics and behavioural patterns that were identified shifted understanding of how the farming community engage with commercial organisations.

The algorithm speaks?

In essence, a large language model is fundamentally an algorithm that emulates, and processes human language to show “statistically likely continuations of word sequences” based on all the data it was ever trained on. In a study by researchers from Princeton and Yale, they highlighted that LLMs “are trained on many people, but then tend to collapse the diversity of judgments into a single modal opinion". This leads to an interesting question: Would asking the LLM for the opinions of a hundred different people be the same as asking one person the same opinion a hundred times?

In Kantar’s experiment in synthetic data in 2023, compared verbatim responses around attitudes towards technology from human data and GPT-4’s repeated responses. In addition to the overly positive responses as earlier discussed, they discovered that the responses generated by GPT-4 repeated stereotypes and lacked variance or nuance. However, it is precisely these variances and that highlight the complexities in human decision making, and bring an added dimension to building credible, relatable personas.

In our own experience, we ran a B2B research program reaching a niche segment of medical specialists. Given the small niche, we would have expected that their needs would be broadly similar. With interrogation of the collected qualitative interviews and survey data, we developed and sized seven distinct personas driven by personal motivations and varied life stages. This allowed us to create varied strategies to reach those whose motivations and needs best corresponded to their product offering. More importantly, the personal narratives shared with us highlighted the human tensions they faced in their decision-making, and these brought the personas to life.

Putting the human at the centre of it

The launch of ChatGPT in 2023 rocketed AI into the collective consciousness with new uses and constant technological improvements. With the significant strides achieved in the last year, the case for synthetic data will only continue to get stronger over time. In its current iteration, the challenges around replacing data collection with synthetic data in the research process suggest that it may be premature to make a complete pivot.

This is not to say that synthetic data does not have a place in the corporate boardroom—the large pool of accessible information remains invaluable for developing hypotheses and refining ideas. However, when it comes to crucial business decisions, these should continue to be guided by human data, capturing the intersection of logic and emotion that makeup preferences and behaviour to account for the humanness in making better business decisions.

Amanda Chew a senior research executive at Merkle's B2B International.