Lecture 1 — Data & Statistics (20 Advanced Application Questions)

Q1 — Choosing Data Type for a KPI Redesign

A bank wants to track “customer relationship depth” for cross-selling. They propose to code it as: 0=None, 1=Single product, 2=Two products, 3=Three or more. The data team must pick the proper scale of measurement and explain what analyses are (not) appropriate.


Q2 — Cross-Section vs Time Series Decision

Your CFO asks whether to build a cross-section or time-series dataset to study drivers of late payments. You have one month to advise credit policy next quarter.


Q3 — Avoiding Unethical Graphs

A marketing slide shows a bar chart with a truncated y-axis from 90 to 100, making tiny increases look huge. You must critique and fix it.


Q4 — Sampling: Observational vs Experimental

HR wonders if a new interview rubric improves hiring quality. Budget allows only historical data analysis for now.


Q5 — Variable Design: Ratio or Interval?

A retailer tracks app session length (minutes) and Net Promoter Score (−100 to 100). Classify each scale and one valid transformation per variable.


Q6 — Data Errors and Controls

Finance found negative inventory values after an ERP migration. Propose two pre-processing controls and one post-migration audit.


Q7 — Choosing Summary Statistics

CEO wants a single “typical” salary number for PR. Pay is right-skewed with a few star earners.


Q8 — Dashboard KPI Integrity

A dashboard shows weekly revenue mean without customer count. It spikes when low-volume weeks include one large deal. How to stabilize?


Q9 — Categorical vs Quantitative Encoding

Logistics tracks “delivery window” as Morning/Afternoon/Evening. Analyst encoded Morning=1, Afternoon=2, Evening=3 and ran linear regression.


Q10 — Population vs Sample Framing

A churn analysis uses all current customers (200k) and calls it “population,” concluding p-values are unnecessary.


Q11 — KPI Manipulation Risk

A plant is rated by “defect rate per inspected unit.” They start inspecting fewer units.


Q12 — Selecting Visualization for Stakeholders

Legal team wants to compare complaint categories across two years and highlight which categories grew the most in share.


Q13 — Data Source Trade-offs

You can purchase a costly consumer panel dataset or scrape social media mentions for free. The research question is “average monthly spend by life-stage.”


Q14 — Construct Validity Check

A “financial literacy score” sums correct answers (0–10). Marketing correlates it with “investment assets ($).” They claim assets cause higher literacy.


Q15 — Unit Consistency in Data Warehouse

Sales uploads prices sometimes in USD, sometimes in CNY, same column name.


Q16 — Choosing Central Tendency for SLA

IT commits to a “typical ticket resolution time” SLA. Resolution times are heavy-tailed.


Q17 — Designing a Pilot with Ethical Bounds

To test a pricing change, you consider randomizing prices across customers. Legal warns about fairness.


Q18 — From Descriptive to Predictive

Analytics team produced great dashboards but no actions. Propose one path from description → prediction → prescription.


Q19 — Handling Big Data Overfitting

A rich feature set yields an AUC of 0.95 on training but 0.68 on test.


Q20 — Communicating Statistical Inference to Execs

Your “battery life improvement” test shows p=0.03 for +5% mean increase. Exec asks: “So we’re 97% sure it’s better?”


Q1 — Binning Choice and Business Impact

Your app’s session length (minutes) is right-skewed. A PM proposes 5 equal-width bins for the homepage dashboard. Data team worries about misleading “most users in 0–5” messaging.


Q2 — Outlier Policy vs Root Cause

Defect times show a few 10× spikes. Ops asks to winsorize at P99 and move on. QA asks to trace root cause first.


Q3 — Choosing the “Typical” Basket

Grocery wants a “typical basket size” KPI for store ranking. Distribution is bimodal (quick trips vs stock-up).


Q4 — Boxplot vs Violin for Execs

For payroll equity, you must show pay level differences across regions to a non-technical board.


Q5 — Missing Data Mechanism Test

Credit limit has 7% missing; exploratory plots show missingness higher among low-income segments.


Q6 — Simpson’s Paradox Alarm

Company-wide, conversion A<B; within each channel, A>B. Marketing asks which variant to ship.


Q7 — Pareto and Long Tail Actions

Skus show 25% contribute 88% of GMV; tail has thousands of items with sporadic sales.


Q8 — Detecting Seasonality in EDA

Daily orders show weekly cycles and holiday spikes. Manager wants a simple seasonal index.


Q9 — Correlation Trap

Scatter of ad spend vs revenue shows r=0.78. After de-seasonalization, r drops to 0.22.


Q10 — Feature Leakage During EDA

Churn label is defined using “no purchase in next 90 days.” Analyst plots features including “days since next purchase.”


Q11 — Choosing a Robust Scale for Plot

Delivery times have extreme outliers. Histogram hides body. What axis/transform?


Q12 — Categorical EDA With Many Levels

There are 600 sku categories. Bar charts become unreadable.


Q13 — Multivariate Outliers

3-var dataset (price, promo depth, demand) shows no univariate outliers, but odd triplets exist.


Q14 — Small-n Visualization

Clinic pilot with 18 patients; outcome is binary. What plot?


Q15 — Ranking Stores Fairly

Raw conversion rates rank small stores higher due to variance. How to rank?


Q16 — EDA for A/B Pre-Check

Before launching an experiment, you must check balance between treatment/control historical covariates.


Q17 — Time Granularity Choice

Fraud signals at second-level timestamps; business reviews monthly. At which granularity to EDA?


Q18 — Reference Class Forecasting

CFO asks for expected cost of a new data center. Prior projects show optimistic bias.


Q19 — Label Noise Check

Manual labels for complaint types disagree 12% between raters.


Q20 — Communicating EDA Limits

Exec wants “definitive drivers” from EDA slides tomorrow.