Category Archives: Enhanced Voice

Surveys and Mediocracy

In January, Kathy Sierra wrote a blog entry talking about how companies fail by choosing not to innovate. The supposition is that companies that are overly risk-adverse eventually fail. Though the article wasn’t necessarily groundbreaking, there was a graphic that immediately caught my eye. It showed a continuum running from love to hate, with a grey area in the middle. The picture showed that a company can be loved or hated and still be successful, but those stuck in the middle or “zone of mediocrity” are, to use her word, “screwed”.

While architecting and implementing IVR and other voice applications, one of the most common applications that I was asked to design were post-call automated surveys. These applications were triggered after an agent hung up – and either transferred the caller directly to the survey, or started an outbound call back to the original caller. The idea was that having an immediate survey after a call ended accurately represented the caller’s experience.

Most of the time, I argued against implementing these surveys.

Contact center managers feel that these surveys can be used to track an agent’s performance. By using the looming presence of an automated survey, the agent would have to always perform admirably in order to keep from getting low marks. The problem is that the results rarely represent real life.

The graphic above represents the reality of survey respondents. People who have a very polarizing experience: either excellent or horrible tend to respond to surveys. Those who felt that it was a sufficient experience don’t feel the need to respond. Call center managers believe that you can use incentives to get responses, but doing that skews the responses towards the positive side.

The other problem is that a caller’s happiness is rarely based on the agent and is more the result of the nature of the call disposition. If I call to get a rental car and there are none available, it doesn’t matter how nice the agent on the phone is, I wasn’t satisfied. On the other end, a typical agent who, by luck, finds that elusive rental car after I’ve struck out at three other agencies would get a very high response.

That’s why when a call center wants to implement post call surveys, it’s critical to understand the motivation behind implementing one. If they’re looking to do trending – looking at the results before and after implementing a new process or program, that’s fine. If it’s part of an overall call center “health check”, that’s OK as well. On the other end, if it’s being used for employee monitoring, training or “compliance”, then I’d have to advise against it.

Strong Auth Drives Conversational Access

When I’m wearing my analyst hat, I’m constantly asked if “this the year for…” Is it the year for VoiceXML? The year for Speech Recognition? The year for speaker verification/voice biometrics? The year for VoIP? For the past year, I’d answer every question the same way, “2007 should be a big year” because the robustness of the technology, combined with a maturity of the vendors in the Conversational Access Technologies (CAT) arena lent towards the adoption of all of these technologies.

I still believe that 2007 is the year when we do turn that corner, hit the end of the runway and take off, cross the chasm and meet up with every other business cliché that describes what happens when the latent need for solutions breaks through the fear factor of being an early adopter and sales start to ramp up. However, next year’s growth is not due the technology or the vendors, or even cost avoidance. The next year’s growth will be based on meeting federal mandates such as FFIEC.

The first generation of Conversational Access Technologies were found in the financial industry, which brought us the first widespread use of IVRs for handling self service for credit cards. It’s the financial industry that will also drive adoption of the next generation of CAT.

The trigger, as mentioned in previous reports and advisories authored by (and for) Opus Research, is the FFIEC guidance. The guidance stated that in 2006, financial institutions needed to implement multi-factor authentication for the web. In 2007, this extends into telephony channels as well.

Early implementers of multi-factor security at banks primarily went down one of two paths: One-Time-Password generating tokens and Shared Secrets.

One-Time-Password generating tokens were obvious for many banks, as internally they have been used for years to restrict internal access to secure platforms. Solutions such as RSA’s SecurID and Verisign’s VIP generate a new numeric PIN every 60 seconds. A user would log into a website with their UserID and password, then enter the generated PIN and get access. It’s a very straightforward solution, though it has been considered expensive as each user needs to get their own token which displays the one-time password. RSA is considered the market leader in hardware based OTP technology.

Shared “Secret” Information makes up the other predominant solution for handling verification. There are three major categories of shared secrets:

  • Self-Supplied Secrets: in this case, the system asks you, at the point of registration, to answer a number of questions (What city were you born in, what is your favorite color) and at login, you will be asked to answer one or more of these questions.
  • Historical Data: in this case, the system uses historical information ranging from “what was the amount of your last deposit” to “when did you pay off your car loan” to “what was your address in January 2001, gleaned from a number of public and internal databases. You don’t pre-answer any question.
  • Photo Preferences: also pushed to market by RSA as a result of its PassMark acquisition, this method has you pick a preferred photo out of a selection of up to thousands of photos and at login, you’ll have to select that photo again to log in.

The failure of shared “secret” information is that it is rarely secret, and more important, the more this “secret” information is used, the less secure it becomes.

For example, when it comes to self-supplied secrets, the most common questions are easily found on the web or in publicly accessible databases: birthdate, mother’s maiden name, pet name, etc. How did Paris Hilton’s Sidekick get “hacked”? Someone figured out that she used her dog’s name. More important, is the fact that most websites and services ask the same information: birthdate, street where you grew up, mother’s maiden name, favorite pet – which means that your information is more and more out in the open. Historical data is also challenging – when I went online to request a copy of my credit report, it took me five minutes to figure out if I ever had a student loan from a specific bank, as the lender has changed multiple times based on consolidations and one bank selling the loan to another bank.

However the largest challenge with shared “secret” information is that this information is very much only applicable for the web. Securing a phone transaction with a picture is ineffective, and being able to speak freeform text to answer a historical or shared secret question isn’t technologically feasible. The only option would be to present a multiple choice for the user to answer, but best practices and common sense rule out any security method where a potential answer is given at the time of the challenge.

This is why 2007 becomes the year of the CAT.

The mandate for banks to implement multi-factor authentication for the web left the field wide open for vendors to propose “creative” solutions to achieve FFIEC compliance. However, once voice is thrown into the mix, the list drops dramatically. With voice, there are two available methods for authenticating: touchtone and voice. This leaves two methods for strong authentication: speaker verification and one-time pins.

Now, though I am the CTO of a voice biometrics firm, from a functionality perspective – both solutions solve the problem. Asking a user to input a one-time numeric PIN generated by a hardware token or to leave a voiceprint to gain entry both satisfy the requirements for multi-factor authentication.

More importantly, both solutions can be easily implemented for web and voice, assuming that the bank has a strategy for implementing a well thought out CAT infrastructure.

Implementing one-time numeric PINs for the voice follows the web in a CAT environment. In this case, the voice application would ask the user for the OTP, pass it over to the appropriate authentication system and get a response back regarding the user passing or failing the authentication request. Since the OTP system is already integrated to a web process (typically a web service/SOAP call), the voice application can make the same call (simplified with the use of a VoiceXML 2.1 request) and parse the same response to gain access.

Implementing voice biometrics for a web process, however is more challenging, but still easily handled. The typical process, as shown by vendors ranging from Authentify to VxV Solutions (my company) show a process where a web user, after starting the login process, is instructed to call a phone number and authenticate his or her voice, either receiving back a one-time pin (also called a Soft-OTP) or being redirected back to the web application after passing the biometric claim.

In both cases, the key is that the bank can now standardize their processes for handling both web and voice transactions. However, standardizing the processes doesn’t necessarily mean standardizing the method. It is expected that banks, and enterprises in general, will support multiple authentication methods based on the user’s needs and status. For example, shipping an OTP token with the bank’s name engraved on the back may cost upwards of $30 per user, but for key clients, the cost may be mitigated by the fact that it is a very fast way to log in. Conversely, a voice biometric solution is typically much cheaper, though less convenient for web users as it requires the user to make a phone call to enable their web session.

What is expected is the growth of a new range of multi-factor brokerage services, such as Ping Identity’s PingLogin solution: designed to let a user select the preferred method of providing multiple factors. In this case, consider a preferred bank customer. He (or she) may have an OTP token provided by the bank and a fingerprint scanner at home. The bank may have also enrolled a voiceprint. When the user logs into the website from work, he could use a voiceprint or OTP – when calling in, he could use the same voiceprint or OTP, but when logging in from home, all three methods could be used. Fingerprint, in this case, would most likely be the fastest and least obtrusive.

Instead of integrating each of these solutions into the voice and web applications, and requiring separate dedicated logic, the authentication broker would simply determine which methods are available, which can be used based on the mode, and then allow the user to select the method of his or her choice.

Again, the benefit of this type of broker is now exponentially increased based on the implementation of CAT. Common SOAP interfaces and easy integration into voice and web applications allows for this choice of flexible multi-factor authentication.

If 2007 is the year that CAT turns the corner, or crosses the chasm, or whatever we’re calling it these days – I’m looking towards 2008 to be the year of federated security. You can’t have all of these banks making investment in strong, multi-factor authentication without someone finding a way where they can monetize the implementations – and leveraging these internal identity databases and authentication methods lends towards these FFIEC compliant banks looking towards becoming independent, trusted Identity Providers (IdP). The currently blog-centric OpenID movement shows the beginnings of a decentralized security model where a user could use an identity at their bank to get into their healthcare account, or into their cable system to get their latest bill. Adding trusted Identity Providers helps move the focus of OpenID from blogs to transactional accounts such as banking and finance.

The Killer Smartphone App

The rumor mill is burning the midnight oil regarding the eternally impending release of Apple’s Smartphone offering – the iPhone (or iChat AV Mobile, depending on which site you consider canon for scuttle such as this). This possibly revolutionary device is thought to be the ultimate merger of media, data and telephony – offering full iPod music playback, synchronization with iCal, Address Book, Mail and .mac. Either it’ll support Cingular, or T-Mobile, or all of the mobile carriers and support 3G data networks… again, depending on the rumor source you trust.

But in response to all of the chatter about the media and synchronization capabilities, there is a true killer app that, if implemented, would make the iPhone the hands-down best smartphone in existence.

What I’m looking for (and most of the smartphone community in general) is a smartphone, with the emphasis on phone. As in – I primarily want to be able to make and receive calls.

11 months ago, I picked up an o2 Mini S on eBay. In the US, it’s called the Cingular 8125, T-Mobile MDA or i-Mate k-jam. This device was touted by the telephone rags to be the best fusion of PDA and phone: a small enough factor that it’s comfortable to carry, a full day’s worth of charge, slide-out QWERTY keyboard and touchscreen. For the first few months, I was generally happy with the device. I could synch it to my office PC and home Mac, it the bluetooth worked fine, and the text messaging capabilities were pretty darned solid. The music and video playback was smooth and the camera, for a multifunction device was better than average.

However, what I started to quickly notice is that the phone services, well, they weren’t just sub-par, they sucked.

The problem isn’t just endemic to this one device, but to Windows Mobile devices in general. The problem is that the telephony functions are constantly fighting for CPU cycles with the general PDA functions. The result can manifest itself in many ways – not being able to get resources to make the phone ring, or not having the resources available to accept me pressing the “answer” button to receive an incoming call. This week alone, I’ve missed three calls simply because the phone didn’t respond when I pressed answer, and it isn’t just me. This problem also impacts most of the touchscreen Windows Mobile 5 devices.

The biggest problem for me, however, is the voice stability. As my work is primarily in speech technologies, I can easily tell how well a telephone device encodes an audio stream. Now, the HTC mobile devices (the manufacturer for most of the non-Samsung/Motorola Windows Mobile devices) have already crippled the phones by implementing a very poor microphone – a pinhole style that, even under the best conditions (as tested with a simple audio recording application) creates poor quality recordings. This is further complicated because of a core Windows Mobile issue.

It seems that Windows Mobile has made the audio encoder (codec) just another software process that has to fight with the other applications for CPU cycles. When the CPU is occupied with other tasks, the encoding is crippled to the point that simple speech recognition can be dramatically impacted. Again, this was simple enough to test by calling into one of my systems without any other apps running, then with email, then with email and a video. Overclocking the CPU helps, but not consistently and not enough.

Though this problem impacts my WM5 device, it impacts every smartphone out there. Count the number of Blackberry users who also carry standard cell phones because the phone experience on these devices aren’t very good. Treo users seem to be in the best boat, but it’s still less than great.

It seems that if you want a successful smartphone, it needs to be just that – a smart phone. Apple has the marketing and mindshare, but can they actually create the compelling device?