MCA/BTech - Data Science - Understanding Data Preparation (Questions and Answers)
Q1: Why is there need to prepare data?
Answer: It is estimated that over 2.5 exabytes are created and collected by people and organisations each day. Here are the main reasons we need to prepare data are:
① 60% to 95% of the time is spent preparing the data. Some data preparation is needed for all mining tools.
② The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool.
③ Error prediction rate should be lower (or the same) after the preparation as it was before.
④ Data need to be formatted for a given software tool.
⑤ Data need to be made adequate for a given method
⑥ Data in the real world is dirty. It could be incomplete, noisy and/or inconsistent.
⊙ Incomplete Data: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. e.g. occupation=“”
⊙ Noisy data: containing errors or outliers. e.g. Salary=“-10”, Age=“222”
⊙ Inconsistent Data: containing discrepancies in codes or names.
e.g. Earlier rating was “1,2,3”, now rating is “A, B, C”
That's why we need to process, explore and condition the data before we can model it.
Q2: Why is data dirty?
Answer: Data in the real world is dirty. It could be incomplete, noisy and/or inconsistent.
① Incomplete data may come from:
- “Not applicable” data value when collected
- Different considerations between the time when the data was collected and when it was analysed.
- Human/hardware/software problems
② Noisy data (incorrect values) may come from:
- Faulty data collection procedures.
- Human or computer error at data entry.
- Errors in data transmission
③ Inconsistent data may come from
- Different data sources
- Functional dependency violation (e.g., modify some linked data)
- There are times when the data provided is arbitrary and we may not obtain expected results.
④ Duplicate records also need data cleaning.
⑤ Lack of Data Privacy
- Due to lack of data regulations and in-comprehensive data security personal data may leak. e.g. Data collected contain Credit Cards and passwords.
Q3: Why is data preprocessing important?
① No quality data means no quality results.
② Data models and data warehouses needs integration of consistent data.
③ Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse.
④ The preparation of the data is concerned with obtaining, cleaning, normalising, and transforming data into an optimised dataset, and making it suitable for analysis.
Q4: What are the major tasks in data preparation.
① Data discretisation: Part of data reduction but with particular importance, especially for numerical data
② Data cleaning: Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
③ Data integration: Integration of multiple databases, data cubes, or files
④ Data transformation: Normalisation and aggregation
⑤ Data reduction: Obtains reduced representation in volume but produces the same or similar analytical results
Q5: How does data cleaning play a vital role in the analysis?
Answer: Data cleaning can help in analysis because:
① Cleaning data from multiple sources helps transform it into a format that data analysts or data scientists can work with.
② Data Cleaning helps increase the accuracy of the model in machine learning.
③ It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
④ It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
Q6: Differentiate between Observational Data and Experimental Data.
Observational Data: It is the data collected based on what’s seen or heard by a person. e.g. Diagnostic data, data collected during cohort study, sales on a given date etc. Most of the data collected is observational data. It is collected passively.
Experimental Data: The data is is collected following the scientific method using a prescribed methodology. e.g. clinical drug trials, A/B experiments data.
Q7: Differentiate between structured vs unstructured data.
Structured Data: Structured data is clearly defined and searchable types of data. It is stored in a predefined format and usually contains schema. Structured data is commonly stored in data warehouses.
Unstructured Data: Unstructured data refers to things like texts, images, videos etc. It requires techniques to convert it into structured data for analysis. Unstructured data is stored in data lakes.