Which of the following best characterizes the process of knowledge discovery and data mining?

Learn about data mining, which combines statistics and artificial intelligence to analyze large data sets to discover useful information.

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets. Given the evolution of data warehousing technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large-scale, leaders still face challenges with scalability and automation.

Data mining has improved organizational decision-making through insightful data analyses. The data mining techniques that underpin these analyses can be divided into two main purposes; they can either describe the target dataset or they can predict outcomes through the use of machine learning algorithms. These methods are used to organize and filter data, surfacing the most interesting information, from fraud detection to user behaviors, bottlenecks, and even security breaches.

When combined with data analytics and visualization tools, like Apache Spark, delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within artificial intelligence only continue to expedite adoption across industries.  

Data mining process

The data mining process involves a number of steps from data collection to visualization to extract valuable information from large data sets. As mentioned above, data mining techniques are used to generate descriptions and predictions about a target data set. Data scientists describe data through their observations of patterns, associations, and correlations. They also classify and cluster data through classification and regression methods, and identify outliers for use cases, like spam detection.

Data mining usually consists of four main steps: setting objectives, data gathering and preparation, applying data mining algorithms, and evaluating results.

1. Set the business objectives: This can be the hardest part of the data mining process, and many organizations spend too little time on this important step. Data scientists and business stakeholders need to work together to define the business problem, which helps inform the data questions and parameters for a given project. Analysts may also need to do additional research to understand the business context appropriately.

2. Data preparation: Once the scope of the problem is defined, it is easier for data scientists to identify which set of data will help answer the pertinent questions to the business. Once they collect the relevant data, the data will be cleaned, removing any noise, such as duplicates, missing values, and outliers. Depending on the dataset, an additional step may be taken to reduce the number of dimensions as too many features can slow down any subsequent computation. Data scientists will look to retain the most important predictors to ensure optimal accuracy within any models.

3. Model building and pattern mining: Depending on the type of analysis, data scientists may investigate any interesting data relationships, such as sequential patterns, association rules, or correlations. While high frequency patterns have broader applications, sometimes the deviations in the data can be more interesting, highlighting areas of potential fraud.

Deep learning algorithms may also be applied to classify or cluster a data set depending on the available data. If the input data is labelled (i.e. supervised learning), a classification model may be used to categorize data, or alternatively, a regression may be applied to predict the likelihood of a particular assignment. If the dataset isn’t labelled (i.e. unsupervised learning), the individual data points in the training set are compared with one another to discover underlying similarities, clustering them based on those characteristics.

4. Evaluation of results and implementation of knowledge: Once the data is aggregated, the results need to be evaluated and interpreted. When finalizing results, they should be valid, novel, useful, and understandable. When this criteria is met, organizations can use this knowledge to implement new strategies, achieving their intended objectives.

Data mining techniques

Data mining works by using various algorithms and techniques to turn large volumes of data into useful information. Here are some of the most common ones:

Association rules: An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines.

Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold), and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. When the cost function is at or near zero, we can be confident in the model’s accuracy to yield the correct answer.

Decision tree: This data mining technique uses classification or regression methods to classify or predict potential outcomes based on a set of decisions. As the name suggests, it uses a tree-like visualization to represent the potential outcomes of these decisions.

K- nearest neighbor (KNN): K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.

Data mining applications

Data mining techniques are widely adopted among business intelligence and data analytics teams, helping them extract knowledge for their organization and industry. Some data mining use cases include:

Sales and marketing

Companies collect a massive amount of data about their customers and prospects. By observing consumer demographics and online user behavior, companies can use data to optimize their marketing campaigns, improving segmentation, cross-sell offers, and customer loyalty programs, yielding higher ROI on marketing efforts. Predictive analyses can also help teams to set expectations with their stakeholders, providing yield estimates from any increases or decreases in marketing investment.

Education

Educational institutions have started to collect data to understand their student populations as well as which environments are conducive to success. As courses continue to transfer to online platforms, they can use a variety of dimensions and metrics to observe and evaluate performance, such as keystroke, student profiles, classes, universities, time spent, etc.

Operational optimization

Process mining leverages data mining techniques to reduce costs across operational functions, enabling organizations to run more efficiently. This practice has helped to identify costly bottlenecks and improve decision-making among business leaders.

Fraud detection

While frequently occurring patterns in data can provide teams with valuable insight, observing data anomalies is also beneficial, assisting companies in detecting fraud. While this is a well-known use case within banking and other financial institutions, SaaS-based companies have also started to adopt these practices to eliminate fake user accounts from their datasets.

Data mining and IBM

Partner with IBM to get started on your latest data mining project. IBM Watson Discovery digs through your data in real-time to reveal hidden patterns, trends and relationships between different pieces of content. Use data mining techniques to gain insights into customer and user behavior, analyze trends in social media and e-commerce, find the root causes of problems and more. There is untapped business value in your hidden insights. Get started with IBM Watson Discovery today.

Sign up for a free Watson Discovery account on IBM Cloud, where you gain access to apps, AI and analytics and can build with 40+ Lite plan services.

To learn more about how IBM’s data warehouse solution, sign up for an IBMid and create your free IBM Cloud account today.

Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers to develop more effective marketing strategies, increase sales and decrease costs. Data mining depends on effective data collection, warehousing, and computer processing.

  • Data mining is the process of analyzing a large batch of information to discern trends and patterns.
  • Data mining can be used by corporations for everything from learning about what customers are interested in or want to buy to fraud detection and spam filtering.
  • Data mining programs break down patterns and connections in data based on what information users request or provide.
  • Social media companies use data mining techniques to commodify their users in order to generate profit.
  • This use of data mining has come under criticism lately as users are often unaware of the data mining happening with their personal information, especially when it is used to influence preferences.

Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It can be used in a variety of ways, such as database marketing, credit risk management, fraud detection, spam Email filtering, or even to discern the sentiment or opinion of users.

The data mining process breaks down into five steps. First, organizations collect data and load it into their data warehouses. Next, they store and manage the data, either on in-house servers or the cloud. Business analysts, management teams, and information technology professionals access the data and determine how they want to organize it. Then, application software sorts the data based on the user's results, and finally, the end-user presents the data in an easy-to-share format, such as a graph or table.

Data mining programs analyze relationships and patterns in data based on what users request. For example, a company can use data mining software to create classes of information. To illustrate, imagine a restaurant wants to use data mining to determine when it should offer certain specials. It looks at the information it has collected and creates classes based on when customers visit and what they order.

In other cases, data miners find clusters of information based on logical relationships or look at associations and sequential patterns to draw conclusions about trends in consumer behavior.

Warehousing is an important aspect of data mining. Warehousing is when companies centralize their data into one database or program. With a data warehouse, an organization may spin off segments of the data for specific users to analyze and use. However, in other cases, analysts may start with the data they want and create a data warehouse based on those specs.

Cloud data warehouse solutions use space and power of a cloud provider to store data from data sources. This allows smaller companies to leverage digital solutions for storage, security, and analytics.

Data mining uses algorithms and various techniques to convert large collections of data into useful output. The most popular types of data mining techniques include:

  • Association rules, also referred to as market basket analysis, searches for relationships between variables. This relationship in itself creates additional value within the data set as it strives to link pieces of data. For example, association rules would search a company's sales history to see which products are most commonly purchased together; with this information, stores can plan, promote, and forecast accordingly.
  • Classification uses predefined classes to assign to objects. These classes describe characteristics of items or represent what the data points have in common with each. This data mining technique allows the underlying data to be more neatly categorized and summarized across similar features or product lines.
  • Clustering is similar to classification. However, clustering identified similarities between objects, then groups those items based on what makes them different from other items. While classification may result in groups such as "shampoo", "conditioner", "soap", and "toothpaste", clustering may identify groups such as "hair care" and "dental health".
  • Decision trees are used to classify or predict an outcome based on a set list of criteria or decisions. A decision tree is used to ask for input of a series of cascading questions that sort the dataset based on responses given. Sometimes depicted as a tree-like visual, a decision tree allows for specific direction and user input when drilling deeper into the data.
  • K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity to other data. The basis for KNN is rooted in the assumption that data points that are close to each are more similar to each other than other bits of data. This non-parametric, supervised technique is used to predict features of a group based on individual data points.
  • Neural networks process data through the use of nodes. These nodes is comprised of inputs, weights, and an output. Data is mapped through supervised learning (similar to how the human brain is interconnected). This model can be fit to give threshold values to determine a model's accuracy.
  • Predictive analysis strives to leverage historical information to build graphical or mathematical models to forecast future outcomes. Overlapping with regression analysis, this data mining technique aims at supporting an unknown figure in the future based on current data on hand.

To be most effective, data analysts generally follow a certain flow of tasks along the data mining process. Without this structure, an analyst may encounter an issue in the middle of their analysis that could have easily been prevented had they prepared for it earlier. The data mining process is usually broken into the following steps.

Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the underlying entity and the project at hand. What are the goals the company is trying to achieve by mining data? What is their current business situation? What are the findings of a SWOT analysis? Before looking at any data, the mining process starts by understanding what will define success at the end of the process.

Once the business problem has been clearly defined, it's time to start thinking about data. This includes what sources are available, how it will be secured stored, how information will be gathered, and what the final outcome or analysis may look like. This step also critically thinks about what limits their are to data, storage, security, and collection and assesses how these constraints will impact the data mining process.

It's now time to get our hands on information. Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes, and checked for reasonableness. During this stage of data mining, the data may also be checked for size as an overbearing collection of information may unnecessarily slow computations and analysis.

With our clean data set in hand, it's time to crunch the numbers. Data scientists use the types of data mining above to search for relationships, trends, associations, or sequential patterns. The data may also be fed into predictive models to assess how previous bits of information may translate into future outcomes.

The data-centered aspect of data mining concludes by assessing the findings of the data model(s). The outcomes from the analysis may be aggregated, interpreted, and presented to decision-makers that have largely be excluded from the data mining process to this point. In this step, organizations can choose to make decisions based on the findings.

The data mining process concludes with management taking steps in response to the findings of the analysis. The company may decide the information was not strong enough or the findings were not relevant to change course. Alternatively, the company may strategically pivot based on findings. In either case, management reviews the ultimate impacts of the business and re-creates future data mining loops by identifying new business problems or opportunities.

Different data mining processing models will have different steps, though the general process is usually pretty similar. For example, the Knowledge Discovery Databases model has nine steps, the CRISP-DM model has six steps, and the SEMMA process model has five steps.

In today's age of information, it seems like almost every department, industry, sector, and company can make use of data mining. Data mining is a vague process that has many different applications as long as there is a body of data to analyze.

The ultimate goal of a company is to make money, and data mining encourages smarter, more efficient use of capital to drive revenue growth. Consider the point-of-sale register at your favorite local coffee shop. For every sale, that coffeehouse collects the time a purchase was made, what products were sold together, and what baked goods are most popular. Using this information, the shop can strategically craft its product line.

Once the coffeehouse above knows its ideal line-up, it's time to implement the changes. However, to make its marketing efforts more effective, the store can use data mining to understand where its clients see ads, what demographics to target, where to place digital ads, and what marketing strategies most resonate with customers. This includes aligning marketing campaigns, promotional offers, cross-sell offers, and programs to findings of data mining.

For companies that produce their own goods, data mining plays an integral part in analyzing how much each raw material costs, what materials are being used most efficiently, how time is spent along the manufacturing process, and what bottlenecks negatively impact the process. Data mining helps ensure the flow of goods is uninterrupted and least costly.

The heart of data mining is finding patterns, trends, and correlations that link data points together. Therefore, a company can use data mining to identify outliers or correlations that should not exist. For example, a company may analyze its cash flow and find a reoccurring transaction to an unknown account. If this is unexpected, the company may wish to investigate should funds be potentially mismanaged.

Human resources often has a wide range of data available for processing including data on retention, promotions, salary ranges, company benefits and utilization of those benefits, and employee satisfaction surveys. Data mining can correlate this data to get a better understanding of why employees leave and what entices recruits to join.

Customer satisfaction may be caused (or destroyed) for a variety of reasons. Imagine a company that ships goods. A customer may become unhappy with ship time, shipping quality, or communication on shipment expectations. That same customer may become frustrated with long telephone wait times or slow e-mail responses. Data mining gathers operational information about customer interactions and summarizes findings to determine weak points as well as highlights of what the company is doing right.

Data mining ensures a company is collecting and analyzing reliable data. It is often a more rigid, structured process that formally identifies a problem, gathers data related to the problem, and strives to formulate a solution. Therefore, data mining helps a business become more profitable, efficient, or operationally stronger.

Data mining can look very different across applications, but the overall process can be used with almost any new or legacy application. Essentially any type of data can be gathered and analyzed, and almost every business problem that relies on qualifiable evidence can be tackled using data mining.

The end goal of data mining is to take raw bits of information and determine if there is cohesion or correlation among the data. This benefit of data mining allows a company to create value with the information they have on hand that would otherwise not be overly apparent. Though data models can be complex, they can also yield fascinating results, unearth hidden trends, and suggest unique strategies.

This complexity of data mining is one of the largest disadvantages to the process. Data analytics often requires technical skillsets and certain software tools. Some smaller companies may find this to be a barrier of entry too difficult to overcome.

Data mining doesn't always guarantee results. A company may perform statistical analysis, make conclusions based on strong data, implement changes, and not reap any benefits. Through inaccurate findings, market changes, model errors, or inappropriate data populations, data mining can only guide decisions and not ensure outcomes.

There is also a cost component to data mining. Data tools may require ongoing costly subscriptions, and some bits of data may be expensive to obtain. Security and privacy concerns can be pacified, though additional IT infrastructure may be costly as well. Data mining may also be most effective when using huge data sets; however, these data sets must be stored and require heavy computational power to analyze.

Even large companies or government agencies have challenges with data mining. Consider the FDA's white paper on data mining that outlines the challenges of bad information, duplicate data, underreporting, or overreporting.

One of the most lucrative applications of data mining has been that of social media. Platforms like Facebook (owned by Meta), TikTok, Instagram, and Twitter gather reams of data about individual users to make inferences about their preferences in order to send targeted marketing ads. This data is also used to try to influence user behavior and change their preferences, whether it be for a consumer product or who they will vote for in an election.

Data mining on social media has become a big point of contention, with several investigative reports and exposes showing just how nefarious mining users' data can be. At the heart of the issue, users may agree to the terms and conditions of the sites not realizing how their personal information is being collected or to whom their information is being sold to.

Data mining can be used for good, or it can be used illicitly. Here is an example of both.

eBay collects countless bits of information every day, ranging from listings, sales, buyers, and sellers. eBay uses data mining to attribute relationships between products, assess desired price ranges, analyze prior purchase patterns, and forms product categories. eBay outlines the recommendation process as:

  1. Raw item metadata and user historical data is aggregated.
  2. Scrips are run on a trained model to generate and predict the item and user.
  3. A KNN search is performed.
  4. The results are written to a database.
  5. The real-time recommendation takes the user ID, calls the database results, and displays them to the user.

Another cautionary example of data mining includes the Facebook-Cambridge Analytica data scandal. During the 2010s, the British consulting firm Cambridge Analytical collected personal data belong to millions of Facebook users. This information was later analyzed to assist the 2016 presidential campaigns of Ted Cruz and Donald Trump. It is also suspected that Cambridge Analytica interfered with other notable events such as the Brexit referendum.

In slight of inappropriate data mining and misuse of user data, Facebook agreed to pay $100 million for misleading investors about the use of consumer data. The Securities and Exchange Commission claimed Facebook discovered the misuse in 2015 but did not correct disclosures for more than two years.

Data mining is broken into two basic aspects: predictive data mining and descriptive data mining. Predictive data mining is a type of analysis that extracts data that may be helpful in determining an outcome. Description data mining is a type of analysis that informs users of that data of a given outcome.

Data mining relies on big data and advanced computing processes including machine learning and other forms of artificial intelligence (AI). The goal is to find patterns that can lead to inferences or predictions from otherwise unstructured or large data sets.

Data mining also goes by the less-used term knowledge discover in data, or KDD.

Data mining applications range from the financial sector to look for patterns in the markets to governments trying to identify potential security threats. Corporations, and especially online and social media companies, use data mining on their users to create profitable advertising and marketing campaigns that target specific sets of users.

Modern businesses have the ability to gather information on customers, products, manufacturing lines, employees, and storefronts. These random pieces of information may not tell a story, but the use of data mining techniques, applications, and tools helps pieces together information to drive value. The ultimate goal of the data mining process is to compile data, analyze the results, and execute operational strategies based on data mining results.