You are hereBlogs / Frank Diana's blog / The 4th Annual Text Analytics Summit

The 4th Annual Text Analytics Summit


Frank Diana's picture
By Frank Diana - Posted on 19 June 2008

I had the opportunity to attend the 4th Annual Text Analytics Summit this week in Boston. It was well attended with representation from the vendor community, users, and Industry Analysts. Seth Grimes did a great job moderating the session and providing insight into the world of text analytics. I’d say my biggest take-away from the Summit is that Text Analytics is moving mainstream.  A couple of other observations: “Voice” is a dominant theme – whether it’s voice of the customer or voice of the market, there is a great deal of focus on understanding that voice. I would personally like to see more activity in other applications of text analytics. There are so many other business problems to solve with this technology. One big one that I expect to see more activity around is fraud detection in the Insurance industry. Sentiment analysis was the capability that was discussed the most. There are obviously many ways to apply sentiment analysis to create value and Lexalytics and Jodange had very impressive products in this space.  There were a number of great products on display at the Summit. Mainstays like Attensity, IBM, Clarabridge and others showed off their capabilities. The exhibit area was crowded for the full two days. The presentations were packed with information and very engaging. As I mentioned above, I’m seeing a greater awareness of text analytics in the Insurance industry as it relates to fraud detection. The ROI associated with finding fraud earlier in the process is very compelling. I’d be very interested in your thoughts regarding your uses of text analytics. Here’s a list of possible solution areas that we’re seeing: Voice of the Employee, Customer, Community and Market; Patient Safety, Drug Discovery, Clinical Analysis; Law Enforcement, Intelligence Analysis; Litigation Support / eDiscovery; SOX Compliance, Corporate Governance; Investment Analysis; Marketing Campaign Analysis, Advertising Analysis; Claims Analysis, Warranty Analysis; Product Innovation; Reputation Management; and Intelligent Messaging. Let me know what you are seeing and if you have any experiences with any of the above that you can share with us.

Interesting discussion! I wanted to make two quick comments regarding the two issues raised thus far.

For any business there is nothing as important as understanding customers. Of course there are subtle differences between text analytics and data mining, but relying on statistics is one common theme, if not the only common theme. My take on sarcasm’s impact on sentiment analysis is that it could be important if, and I believe only if, one segment of customers use sarcasm more than other segments. For example if unhappy customers about the speed of a service are more sarcastic than happy customers! I firmly believe the way humans communicate is more of a personal characteristic than it is a response to external factors (such as the speed of service they receive from company x). If statistics is the basis for analyzing customer feedback then the same percentage of sarcastic voices will exist in all customer segments, regardless to how customers are segmented and if you were to incorrectly capture sarcasm it should not skew results!

The second issue raised was entity extraction, and I’ll combine it with the issue of combining unstructured data (text) with structured data (rest of customer record). In my opinion no text analytics solution is complete without taking advantage of both data! For decades we, IT practitioners, had to re-arrange data around the relational model which is great for the efficiency of storage and retrieval. But is it that great for analysis as well? I dare to say NO! Most vendors I know provide solutions to convert (abstract) textual data into structured data (triples in most cases), and hence loosing a great deal of insights. How about a reverse, and more natural, approach of converting everything into text? We need solutions to analyze data in context without the artificially imposed abstraction.

Interesting discussion. Providing structure to unstructured has the side effect of taking info out of context. Analysis is dynamic and changes as the user interacts with the system. Thus keeping data where it is and analyzing it as it is will get us high degree of accuracy at the end. Of course, for some applications, extract would still be needed.

As a matter of fact, if we analyze unstructured data as it is, we can also include structured data along with unstructured data for analysis ie. we do the reverse of the traditional technique. A big advantage is getting away from writing SQL to using user friendly english and search like syntax. Another advantage is data fusion where we can combine data from multiple sources for the purposes of analysis without writing complicated SQL.

This approach works very well for Fraud and Subrogation where most of the data needed to detect fraud is in notes and documents & 3rd party databases.

While it is true that state of art sentiment analysis may not be able to detect sarcasm...who cares?

In set of several thousand or million comments where you are trying to aggregate the “voice”, how many of them will include sarcasm? One percent? Ten percent?

This type of analysis is about identifying general trends and/or big problems, not making sure each and every individual comment is extracted perfectly.

If you worry about dealing with sarcasm, you are focused on the technology NOT the business problems people are trying to solve. The technology for text analytics still outpaces the market’s and companies’ willingness, ability, and desire to adopt the technology. Plenty of really important and impactful insights can be found w/o dealing with sarcasm.

The fact that people seem myopically focus on some fringe case shows that this market is in fact NOT moving into the mainstream. Instead I think text analytics (a technology that’s been around for years and could be valuable to organizations) will continue to mature but remain niche, just like data mining.

So in your reply you imply by your question as to what percentage of comments are problematic that you don't know how big the problem is. I would say that being concerned with it is not myopic. If you are going to make decisions based upon this information you better be sure you know what the problems are with these tools and what the scope of error is. I won't say if it's 1% or 50% because I have no idea and neither do you it seems. That is not a slam but just the reality of it. Without a measure of how well a system is picking up sentiment then really what are customers buying when they get a sentiment analysis system? This goes beyond just measuring sarcasm but measuring all types of sentiment.

The business problem is determining the general direction of sentiment with a break down of possible drivers. However the professional who wants to rely upon this information needs to understand more about how this direction is derived and what the likely confidence is of the system's results. I would say back at you that ignoring that is myopic and irresponsible.

When tools can prove their worth with strong, measurable quality of analytics then the market will become mainstream. The big win in this field that I saw at the Text Analytics Summit was the Gaylord Entertainment presentation. That was a phenomenal demonstration of how these tools should work from data input to business decision to final determination of the effect of the decision.

The systems I build for the military all require specific measurement of their quality. While the military doesn't always know how to properly measure the results they at least make the attempt to do so. Sentiment analysis is critical in today's world where our wars are mainly about winning hearts and minds. To get it wrong would be a huge mistake that would cost lives and treasure.

As the vendor in the space that most relies on Sentiment Analysis as the basis of their business I get to see this discussion thread quite often and in many different places.
If you're a PR firm, the truth is that you care about both the big trends and evrey little post, becuase their customers will often hit them with very specific examples where they're receiving poor coverage, and if an automated tool misdiagnoses the sentiment it matters. Other applications can take the long view, we built a travel system a couple of years ago that took hotel reviews to build a "consensus" opinion of each hotel based on the comments about that hotel. In this case a single mistake really didn't affect the general opinion about the hotel.
All this being said, Lexalytics and the other vendors are putting a lot of energy and effort into measuring sentiment at the entity level and on dealing with sarcasm. The technology doesn't yet handle sarcasm, but don't be suprised if it does within a year.
A general observation, which I'll discuss more in our blog is that sentiment is still a bit mystical to many folks, and they generally focus on the wrong piece. It's entity sentiment that matters, the tone of a document rarely has much business value (there are exceptions). It's the People, Products and Companies that are discussed in a story or post that matter, that's the spot you need to measure the sentiment.
Anyway, it was a great show with a lot of energy, I'm glad we were there.

Voice or sentiment analysis was definitely a hot topic but it is also a very difficult problem to solve. With most vendors only offering basic entity extraction with RDBMS storage I am not sure the scoring algorithms I saw were really all that robust or accurate. This theme was picked up by the Text Technologies blog. With idomatic expressions (pretty ugly), sarcasm (I just LOVE slow websites) and the fact that words have multiple meanings you start to see the problem.

The real question I have is, can anyone come up with a quality of analytics measure to see how well the various solutions do at detecting sentiment, weeding out non-credible comments (such as spam attacks on online comment pages) and understanding sarcasm. I think if we start to develop a series of QoA tools then the whole market will be improved.

Overall the summit was very good and I enjoyed the Enherent booth. It was nice meeting you folks.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options