Note: This article is a less technical version of an article recently published on the Zappi Tech blog. If you’re a developer or are interested in getting a more detailed behind the scenes look at Zappi tech, check out the original post here.
Market research automation by now, is a known concept within the market research community. Market research automation uses tech to make market research efforts quicker, more cost-effective, and more efficient.
One question and issue that often arises in regards to market research automation is whether or not it’s truly “better”. If the approach is templated and machine-driven are the results actually as good as traditional market research firms?
The truth? As of now if the process were entirely automated without any human insight, it probably wouldn’t be. There are some issues that automation still doesn’t address and must be updated or filtered out manually. This post addresses how developers are using machine learning and other methodology to make market research results better than ever before.
This post will go behind the scenes in looking at what those issues are, provide four machine learning methods developers could use to address a big issue, and what’s being done by humans in the mean time. For more on artificial intelligence in market research, look here.
Current technical developments in market research automation
Here are some technological developments currently being worked on in market research:
- Recognizing topics in open-ended survey questions. For the sake of this article we’ll focus on an important first step to this, recognizing gibberish. We will compare some new machine learning solutions, as well as some more classic algorithms.
- Centralizing multiple market research reports and cross-analyzing data to determine what’s useful and drawing insights for you.
- Determining sentiment in free-form available data such as tweets or reviews.
- Leveraging cutting-edge technology to provide new ways of collecting data. An example of this being one of our partners, Affectiva, who help us provide facial coding by scanning survey participant faces to gauge their emotional response while watching adverts.
An immediate technical update being actively worked on by our developers is recognizing gibberish in open-ended questions. This will allow automated reports to come quicker and cleaner as they won’t need human intervention to fix or edit responses. This next section will walk you through what’s being done on the back end to try to fix this issue.
Why gibberish should be removed automatically (from a Zappi developer)
At Zappi, we collect a lot of survey data. While most of the survey questions require choosing from a list of options (like radio buttons), we also collect open-ended answers in text boxes. An issue with online open-ended questions is that some respondents don’t take the survey seriously and mash the keyboard to get to the next question.
Aside from not being in the spirit of the survey (respondents are paid), it compromises the analysis further down the line, for example in word clouds.
While we used a stop word list to remove words that are less relevant, we had no way of identifying gibberish.
The problem essentially boils down to a classification problem. We have two classes that we want to distinguish between: Gibberish and Non-Gibberish. Any algorithm that can do this can be called a classifier.
One of the standard ways of assessing classifier performance in machine learning is a Receiver Operator Characteristic, or ROC curve. Most classifiers for a binary true/false case will predict a numerical result where higher is more likely to be true.
If the classifier separates the two types well, such that there is no overlap in the values given to the known true and known false values, then a threshold value can be placed in the middle, giving a perfect result. However most of the time the classifier doesn’t separate the two entirely and the overlap in the values causes some misclassifications regardless of where the threshold is placed.
Accuracy has some problems in our case though. It’s far more important for us to avoid misclassifying data as gibberish and removing valuable information than letting some gibberish slip through. With accuracy, we have no way of knowing what type of errors our classifier makes. Because of this, we chose to use a different metric. Here are four machine learning methods we tested in an attempt to roll this gibberish detection function out.
A quick google shows several prior implementations of gibberish detectors mostly based off of an algorithm by Rob Renald. This works by splitting up a sentence into character transitions and measuring the probability of each of these appearing.
The advantage to this approach is that it’s inherently unsupervised. The algorithm is able to work out what English looks like itself without anyone saying “this sentence is or is not english”. This allows switching out the large text document for one in another language easily and the algorithm should work.
One of the big issues with Markov Chains is they are extremely easy to run backward, to generate what they think ideal input looks like. In this case we get gibberish that the algorithm thinks is English, examples being “sh youndis of mand” and “the the the the the”.
Given this problem we wanted to explore other options to see if a more robust solution was possible. After playing around with several variations of the Markov chain solution, we realized there were several inexcusable disadvantages. Ultimately, we decided to try another model.
Bag of words/vector model
At this point we decided it may be beneficial to move away from Markov Chains and attempt a more natural language processing approach through Bag Of Words (BOW). This method creates vectors that represent each open-ended input. Sentence vectors can also be constructed in a more involved way known as TF-IDF vectoriser which weights words lower if they are popular across all answers.
Although the BOW approach outperforms the Markov model, it has similar pitfalls to that of the Markov method and fails on a few similar edge cases. It performs particularly badly for gibberish sentences that are long and have interspersed vowels, e.g. ‘sectiong i har savanywhis’.
We then moved to a neural net approach.
Neural net approach
Next we tried using a neural net. Inspired by Jason Brownlee’s blog we used a similar network architecture for sequence to sequence gibberish classification. However because open ended comments are typically short instead of using whole words, we used groups of 3 letters instead.
We realized our problem was down to the fact that neural networks are very sensitive to what’s known as “class imbalance”. In the training set gibberish responses were by far in the minority which was leading to it preferring non-gibberish. To rectify this problem, we used a technique called Synthetic Minority Over-sampling Technique, or SMOTE. This artificially creates new gibberish data points for us, based on the existing gibberish.
However, this approach still suffered from the same weaknesses as the Markov and BOW models. In hindsight, as they all rely solely on sequences of characters, this is should have been expected.
We believe that with additional training data this might be overcome, especially if we supplement with additional features. Sadly, we simply do not have the resources to obtain labeled training data for all the languages we would need to support in large enough quantities.
Gibberish response survey approach
An idea for another approach came about from a discussion about how we could exclude brand names from the gibberish detection. The idea was that we could look across an entire survey and compare responses. It is extremely unlikely that the same gibberish response will be repeated by several respondents. If we add to this a dictionary of common words, at a threshold percentage of words not in either dictionary we should be able to detect most, if not all gibberish.
With an accuracy of ~98% it’s hard to argue with its results as a filter. What’s more, all of its false-positive cases were misspelt single word responses, where a human would struggle to extract meaning.
The major flaw of this method is that it requires all responses. This means we cannot reject people writing gibberish from the survey as they are taking it, only filter them out after the fact.
However, despite its flaws it really is a contender with the other methods. With dictionaries for internationalization readily available and with it hitting the target PPV, we decided to go with this method as a first version.
While more advanced machine learning techniques showed promise, ultimately the overheads of obtaining training data and the specific context of our problem lends itself to simpler solutions. By using the unique context of our problem rather than throwing generic machine leaning tools at it, we were able to provide a solution that could be reasoned about easily, and didn’t require labelled training data which would have been time consuming and costly to obtain.
Although the chosen method only offers an after-the-fact solution to removing gibberish, in future we can use it to tag historical data which could potentially overcome the current constraints of the machine learning methods with more training data. This leaves the door open for more powerful methods, while providing value to our partners and clients immediately.
Other benefits machine learning will have on market research automation in the ruture
As tech advances, machine learning is becoming a huge part of market research. Here are some other thoughts on how machine learning will benefit market research in years to come.
- Using data from previously hard-to-assess and disparate sources, such as tweets or reviews.
- Remove menial tasks from market research, such as code framing.
- Allowing more frequent validation of assumptions by making testing cheap and easily available.
Learn more about how tech can power affective product messaging all the way from ideation to creative production with our guide here.