Illuminating (and Eliminating) Fraud using Dark Data

Stephen DeAngelis

March 7, 2013

“We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation,” writes Paul Doscher, CEO of LucidWorks. “The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.” [“Searching for Dark Data,” SiliconANGLE, 11 February 2013] Doscher looks into the abyss of accumulating data and sees more dark data than illuminating insights. He continues:

“Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it. Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark. Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.”

Doscher isn’t criticizing the software tools of the past, he’s simply pointing out that new tools are needed to probe the depths of dark data. In the ocean of data that he describes, old software tools are like the fins and air tanks of a scuba diver. What he believes is needed is a deep-diving submersible like Alvin. Fortunately, he writes, “The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.” He explains:

“The challenge isn’t simply how best to house the data; rather, it’s how companies should go about both searching the different types of integrated data – structured, semi-structured and unstructured – to discover patterns and insights and then analyzing these found data patterns in order to make better business decisions. The functionality that drives the search, discovery and analysis capabilities is rooted in a technology that has recently sprung like a Phoenix from the proverbial ashes, reinvigorated by the advent of Big Data and cloud computing: Enterprise Search.”

According to Wikipedia, Enterprise Search is exactly what it sounds like, namely, “the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.” Doscher is excited about the prospects of delving the depths of dark data because, he writes, “Dark Data is essentially an unedited record that is less subject to bias or inaccuracy than consciously connected data. It really is the raw record of a business’ history.” He continues:

“Without Enterprise Search, businesses can only scratch the surface of the knowledge hiding within the data. Adding Enterprise Search to the mix, however, brings the semi- and unstructured data to life. For instance, consider the following business cases where search is used to mitigate risks and impact bottom lines. An insurance company prices policies through sophisticated algorithms based largely on probabilities. But what if actuarial policies could be issued, let’s say for a trucking company, based on actual miles logged, violations, number and age of drivers, preferred routes, training courses, driver experience and other real-world data in a precise, dynamic way? That detailed information, some of which is structured but most of which is not neatly organized in a single database. Enterprise Search integrates all of the data and extracts insightful information in near real-time.”

Undoubtedly, insurers would like to get more personal when it comes to pricing and issuing policies, but some level of aggregation is necessary — after all, probabilities do still matter. There are fine lines that separate legitimate analytic searches and invasions of privacy. When you start probing for dark data, you need to make sure that those lines are not crossed. Take, for example, the search for fraudulent activity. The benefits of Big Data analytics for detecting fraud are undeniable. To make this point, Doscher returns to his insurance example.

“In order to better manage risk, imagine if the insurance company could put into Hadoop information about every claim ever made – billions and billions of unstructured documents – then run a query across all of that information to uncover trends that indicate fraudulent activity. By sharing those findings with their actuarial and analyst groups, the insurer would be prepared to watch for precursors of fraudulent activity and stop the fraud before it happens. Much of the data needed to make these processes a reality exists, but it is Dark and resides on paper and in disparate data sets.”

Doscher isn’t the only analyst who believes that dark data can illuminate fraudulent activity. Last fall, researchers from the University of Virginia and Brigham Young University announced they had teamed together to “create the most robust and accurate fraud detection system to date using information from publicly available financial statements.” [“BYU researchers detect fraud with highest accuracy to date,” BYU news release, 17 September 2012] The news release explains:

“Using business intelligence software that learns and adapts as it processes data, a team of professors from the Marriott School of Management developed a model that correctly detects fraud with 90 percent accuracy. ‘We’ve improved on 30 years of research in terms of accuracy in capturing fraud patterns,’ said Jim Hansen, information systems professor and study co-author. ‘This improved detection is crucial given the broad societal costs of management fraud.’ Major fraud scandals at Enron, WorldCom and several other firms around the turn of the century were the catalyst for a 78 percent drop in the NASDAQ between 2000 and 2002. Despite changes to internal control procedures following those scandals, fraud continues to plague economies throughout the world. In 2008, fraud was discovered when Lehman Brothers filed for the largest bankruptcy in U.S. history, which dropped the stock market 22 percent and contributed to the most recent recession. Lead author Ahmed Abbasi of the University of Virginia, along with Hansen and BYU information systems colleagues Conan Albrecht and Anthony Vance have spent several years developing their fraud detection tool, ‘MetaFraud.’ The MetaFraud framework is comprised of several base-level artificial intelligence ‘learners’ that feed their results into a ‘meta’ or overarching business intelligence algorithm that learns and adapts over time.”

Perhaps the one area that could benefit most from fraud detection is the healthcare industry. Roger Foster, Senior director of DRC’s high performance technologies group, asserts, “Due to the large dollar amounts and the number of companies and people involved in the healthcare system there is a huge potential for abusive behaviors at all levels.” [“Top 9 fraud and abuse areas big data tools can target,” Government Health IT, 14 May 2012] Jo-Ellen Abou Nader, senior director for program integrity at Express Scripts, reports that “between 3% and 10% of every healthcare dollar spent in the U.S. is lost to fraud. That equals as much as $224 billion each year — a heavy burden for the nation’s healthcare system.” [“,” Healthcare Insights, 14 January 2013] Alexis Madrigal reports, “Express Scripts sits at an interesting spot within the nation’s health care system, right between pharmacies and health care plans. That means they see 1.4 billion prescriptions a year, each one of which … adds a little more data to their pile. They now have 100 people sorting through that information trying to detect fraud. They’ve got nurses and pharmacists and forensic accountants, along with a group of data nerds investigating thousands of cases of shady dealings a year.” [“How Big Data Can Catch Oxycontin Abusers and Bad Docs,” The Atlantic, 21 February 2013] Nader told Madrigal that one case of fraud Express Scripts uncovered involved “a single doctor who doled out 22,000 pills of narcotics to 30 people. That’s $4.6 million of drugs. The doc was smart, too: he changed the strengths of the prescriptions he was writing so that simple reviews of his data might miss that he was giving [a] massive number of pills to patients.” But his shadowy acts were discovered when the illuminating light of Big Data analytics was put to use. Foster asserts that “big-data tools can be used to review large healthcare claims and billing information to target the following”:

  • Assess payment risk associated with each provider
  • Over-utilization of services in very short-time windows
  • Patients simultaneously enrolled in multiple states
  • Geographic dispersion of patients and providers
  • Patients traveling large distances for controlled substances
  • Likelihood of certain types of billing errors
  • Billing for ‘unlikely’ services
  • Pre-established code pair violation
  • Up-coding claims to bill at higher rates

He also asserts that “big data can be used to reduce administrative inefficiencies in healthcare systems.” Every taxpayer and healthcare insurance participant should welcome the fact that tools are now becoming available to help detect fraud. It’s just another way that Big Data analytics benefit us all.