Lecture 1 — Data & Statistics (20 Advanced Application Questions)
Q1 — Choosing Data Type for a KPI Redesign
A bank wants to track “customer relationship depth” for cross-selling. They propose to code it as: 0=None, 1=Single product, 2=Two products, 3=Three or more. The data team must pick the proper scale of measurement and explain what analyses are (not) appropriate.
📖 点击查看答案
Ordinal scale. Allowed: medians, percentiles, nonparametric tests, bar charts. Not appropriate: operations requiring equal intervals or meaningful ratios (e.g., arithmetic mean comparisons may be misleading; ratios are invalid).
📝 点击查看解析
该变量是有序等级(0<1<2<3),但相邻等级间距不保证相等,因此属于序数尺度。可比较大小与位置(中位数、分位数),但不宜当作等距数据做均值差或回归系数的线性解释,更不能做“3 是 1 的三倍”之类的比率推断。
Q2 — Cross-Section vs Time Series Decision
Your CFO asks whether to build a cross-section or time-series dataset to study drivers of late payments. You have one month to advise credit policy next quarter.
📖 点击查看答案
Start with cross-section (current customers, their attributes, and latest delinquency status) for quick policy drivers; complement with a short panel slice if feasible (recent 6–12 months) to capture dynamics.
📝 点击查看解析
期限紧 → 横截面能更快给出“此刻哪些特征与逾期相关”。如能追加近几期账龄演变构成短面板,可更稳健识别因果方向与滞后影响;纯时间序列更适合宏观率(如总体逾期率)的趋势预测,不适合个体层面的特征洞察。
Q3 — Avoiding Unethical Graphs
A marketing slide shows a bar chart with a truncated y-axis from 90 to 100, making tiny increases look huge. You must critique and fix it.
📖 点击查看答案
Critique: misleading scaling and visual exaggeration. Fix: start y-axis at 0, show data labels/CI, keep consistent intervals, add clear title/units.
📝 点击查看解析
伦理指南强调避免误导。截断坐标会夸大差异;应从零起点、标注度量单位与不确定性(如置信区间),并保持一致刻度与配色。
Q4 — Sampling: Observational vs Experimental
HR wonders if a new interview rubric improves hiring quality. Budget allows only historical data analysis for now.
📖 点击查看答案
This will be an observational study using past hires; caution about confounding. Recommend a phased A/B experimental rollout later (randomized rubric assignment) for causal evidence.
📝 点击查看解析
仅用历史数据无法控制混杂变量 → 相关不等于因果。建议先做观察性评估(基线),再设计随机对照实验获得因果识别。
Q5 — Variable Design: Ratio or Interval?
A retailer tracks app session length (minutes) and Net Promoter Score (−100 to 100). Classify each scale and one valid transformation per variable.
📖 点击查看答案
Session length: Ratio (true zero, meaningful ratios). Valid: log transform for skew. NPS: Interval-like index (arbitrary zero). Valid: standardization (z-score) for comparisons.
📝 点击查看解析
时长有真实零点,可比较倍数;NPS 的零并非“无满意度”,更像等距指标,适合中心化/标准化,不宜做比例比较。
Q6 — Data Errors and Controls
Finance found negative inventory values after an ERP migration. Propose two pre-processing controls and one post-migration audit.
📖 点击查看答案
Pre-controls: (1) schema & unit mapping checklist; (2) constraint rules (non-negative, integer where needed). Post-audit: stratified sample reconciliation vs source docs.
📝 点击查看解析
迁移前统一字段与单位并设置完整性约束,迁移后用分层抽样对账,定位规则/映射错误的集中区域。
Q7 — Choosing Summary Statistics
CEO wants a single “typical” salary number for PR. Pay is right-skewed with a few star earners.
📖 点击查看答案
Use median (robust to outliers) and accompany with IQR; avoid using the mean alone.
📝 点击查看解析
右偏分布均值被极端值拉高,中位数 + IQR能更真实反映“典型水平”。
Q8 — Dashboard KPI Integrity
A dashboard shows weekly revenue mean without customer count. It spikes when low-volume weeks include one large deal. How to stabilize?
📖 点击查看答案
Add weighted KPIs (ARPU = revenue/customers), show distribution (median, IQR), and include volume context (n).
📝 点击查看解析
没有分母与样本量会误导。用单位客户指标、稳健统计量和样本量共同呈现,避免“大单幻觉”。
Q9 — Categorical vs Quantitative Encoding
Logistics tracks “delivery window” as Morning/Afternoon/Evening. Analyst encoded Morning=1, Afternoon=2, Evening=3 and ran linear regression.
📖 点击查看答案
That imposes false ordinality. Use one-hot/dummy variables or treat as ordinal only if domain justifies monotone spacing.
📝 点击查看解析
类别型变量乱设数值会引入伪顺序与等距假设。默认用哑变量;若业务明确“越晚越差”且相邻间距近似,可考虑有序模型。
Q10 — Population vs Sample Framing
A churn analysis uses all current customers (200k) and calls it “population,” concluding p-values are unnecessary.
📖 点击查看答案
It’s the population at this time, but inference targets future cohorts; uncertainty remains. Keep out-of-sample validation and intervals.
📝 点击查看解析
描述当前总体 ≠ 推断未来。预测仍需度量泛化不确定性(交叉验证/留出集、置信区间或预测区间)。
Q11 — KPI Manipulation Risk
A plant is rated by “defect rate per inspected unit.” They start inspecting fewer units.
📖 点击查看答案
Metric gaming. Redesign KPI to defects per produced unit and add audit sampling.
📝 点击查看解析
指标应与目标一致,减少可被策略性操纵的空间;过程审计可作为外部约束。
Q12 — Selecting Visualization for Stakeholders
Legal team wants to compare complaint categories across two years and highlight which categories grew the most in share.
📖 点击查看答案
Use stacked 100% bar charts (share), plus delta labels or a dumbbell chart of shares.
📝 点击查看解析
关注结构占比变化 → 100%堆叠条形能直观比较;若要精确差值,用哑铃图呈现两年份占比与差值。
Q13 — Data Source Trade-offs
You can purchase a costly consumer panel dataset or scrape social media mentions for free. The research question is “average monthly spend by life-stage.”
📖 点击查看答案
Buy the panel (structured, spending ground truth). Social mentions lack reliable denominators and sampling frame; use for qualitative signals only.
📝 点击查看解析
研究目标是量化均值,需有代表性与交易口径;社媒数据存在抽样偏差与不可观测分母,更适合作为探索性佐证。
Q14 — Construct Validity Check
A “financial literacy score” sums correct answers (0–10). Marketing correlates it with “investment assets ($).” They claim assets cause higher literacy.
📖 点击查看答案
Beware reverse causality and confounding (education, age). Treat as association; propose instrumental variable or longitudinal design.
📝 点击查看解析
相关≠因果,且资产与素养可能互相影响;通过工具变量或追踪研究提高因果识别度。
Q15 — Unit Consistency in Data Warehouse
Sales uploads prices sometimes in USD, sometimes in CNY, same column name.
📖 点击查看答案
Enforce unit metadata at ingestion, add currency field + FX normalization layer, and block loads with mixed units per batch.
📝 点击查看解析
单位是数据质量关键元数据;需在源头校验与统一换算两端把关,避免后续分析偏差。
Q16 — Choosing Central Tendency for SLA
IT commits to a “typical ticket resolution time” SLA. Resolution times are heavy-tailed.
📖 点击查看答案
Use median SLA (e.g., “50% within 4h”) + percentile targets (e.g., P90 within 24h).
📝 点击查看解析
重尾分布下均值受极端拖长;以分位数约束用户体验更贴近真实感受与业务可达性。
Q17 — Designing a Pilot with Ethical Bounds
To test a pricing change, you consider randomizing prices across customers. Legal warns about fairness.
📖 点击查看答案
Use geo-level or time-window randomization with pre-declared bounds and oversight; monitor harm metrics; add opt-out where applicable.
📝 点击查看解析
在伦理框架下做实验:分层/地域随机、设置价格波动上限、持续监控不利影响并提供退出机制。
Q18 — From Descriptive to Predictive
Analytics team produced great dashboards but no actions. Propose one path from description → prediction → prescription.
📖 点击查看答案
Define a decision (e.g., churn retention offer), build a predictive model (churn risk), and implement a policy (offer rules) with A/B testing.
📝 点击查看解析
描述 → 预测 → 处方:先明确可操作决策,再用模型排序对象,最后用实验验证策略收益与副作用。
Q19 — Handling Big Data Overfitting
A rich feature set yields an AUC of 0.95 on training but 0.68 on test.
📖 点击查看答案
Overfitting. Apply regularization, feature pruning, nested CV, and keep a hold-out for final validation.
📝 点击查看解析
训练集表现过好而泛化差 → 需要惩罚复杂度、减少冗余特征,并用更严谨的验证方案控制乐观偏差。
Q20 — Communicating Statistical Inference to Execs
Your “battery life improvement” test shows p=0.03 for +5% mean increase. Exec asks: “So we’re 97% sure it’s better?”
📖 点击查看答案
No. p=0.03 is not the probability the hypothesis is true. It’s the chance of seeing data this extreme if no improvement. Report effect size + CI and practical impact.
📝 点击查看解析
纠正常见误解:p 值不等于假设为真的概率。更应呈现效应量/置信区间与业务含义(如续航+5%≈多 30 分钟)。
Q1 — Binning Choice and Business Impact
Your app’s session length (minutes) is right-skewed. A PM proposes 5 equal-width bins for the homepage dashboard. Data team worries about misleading “most users in 0–5” messaging.
📖 点击查看答案
Use quantile (equal-count) binning or report median + P90; accompany with a density/histogram using log scale.
📝 点击查看解析
等宽分箱在右偏时会把大量短会话挤在首箱,夸大“多数超短”。等频分箱或对数坐标更稳健;以中位数+P90沟通“典型体验与上尾”。
Q2 — Outlier Policy vs Root Cause
Defect times show a few 10× spikes. Ops asks to winsorize at P99 and move on. QA asks to trace root cause first.
📖 点击查看答案
Root-cause first (trace batch/tool/operator), then choose context-aware handling: tag as special-cause, analyze separately; only winsorize for robust summary reporting.
📝 点击查看解析
离群点可能代表可纠正的特殊原因,直接截尾会掩盖质量问题。区分共因/特因,运营改进优先,统计汇总再稳健处理。
Q3 — Choosing the “Typical” Basket
Grocery wants a “typical basket size” KPI for store ranking. Distribution is bimodal (quick trips vs stock-up).
📖 点击查看答案
Report two medians or mixture-aware stats (e.g., medians by mission type) rather than a single overall mean.
📝 点击查看解析
双峰反映两类顾客任务;单一中心趋势会误导。分群后汇总更可操作(运营到货、排班)。
Q4 — Boxplot vs Violin for Execs
For payroll equity, you must show pay level differences across regions to a non-technical board.
📖 点击查看答案
Use boxplots with notches (median CI) and annotated IQR; keep violins optional in appendix.
📝 点击查看解析
箱线图直观呈现中位数、四分位、离群点;韧性强、认知成本低;提到notch可粗略比较中位差异显著性。
Q5 — Missing Data Mechanism Test
Credit limit has 7% missing; exploratory plots show missingness higher among low-income segments.
📖 点击查看答案
Likely MAR (missing at random conditional on income). Visualize missingness indicators vs covariates; impute with stratified/model-based methods.
📝 点击查看解析
缺失与可观测变量相关 → 条件随机缺失。先做缺失图/热图与logit-missingness探查,再分层/模型插补,保留缺失指示变量用于下游模型。
Q6 — Simpson’s Paradox Alarm
Company-wide, conversion A<B; within each channel, A>B. Marketing asks which variant to ship.
📖 点击查看答案
Ship A, but stratify rollout by channel mix; report weighted overall and by-channel effects.
📝 点击查看解析
这是辛普森悖论:总体权重(渠道占比)混淆。按渠道分层决策,并透明披露加权口径。
Q7 — Pareto and Long Tail Actions
Skus show 25% contribute 88% of GMV; tail has thousands of items with sporadic sales.
📖 点击查看答案
Apply ABC segmentation (A≈88% GMV), tighten replenishment on A, test assortment pruning on C with guardrails (seasonal/strategic SKUs).
📝 点击查看解析
80/20(帕累托)现象指导库存与陈列优先级;长尾不等于全砍,需要基于季节性/品牌战略设置豁免。
Q8 — Detecting Seasonality in EDA
Daily orders show weekly cycles and holiday spikes. Manager wants a simple seasonal index.
📖 点击查看答案
Use multiplicative seasonal indices (weekday factors), compute via ratio-to-moving-average; flag holidays as special events.
📝 点击查看解析
先用滑动平均去趋势,再按日别比得出季节指数;节假日属异常事件,单列标签避免污染常规指数。
Q9 — Correlation Trap
Scatter of ad spend vs revenue shows r=0.78. After de-seasonalization, r drops to 0.22.
📖 点击查看答案
Seasonality confounding. Always detrend/de-seasonalize before correlating; use partial correlation controlling for time.
📝 点击查看解析
共同季节性会产生伪相关。去季节/去趋势或用偏相关更真实反映广告边际作用。
Q10 — Feature Leakage During EDA
Churn label is defined using “no purchase in next 90 days.” Analyst plots features including “days since next purchase.”
📖 点击查看答案
That’s leakage. Restrict EDA to information available at decision time; separate t−window features from t+ outcomes.
📝 点击查看解析
任何使用未来信息的特征都会夸大预测力并误导洞察。时序切分是 EDA 基本纪律。
Q11 — Choosing a Robust Scale for Plot
Delivery times have extreme outliers. Histogram hides body. What axis/transform?
📖 点击查看答案
Plot on log scale or use trimmed axis with clear disclosure; consider ECDF for full distribution.
📝 点击查看解析
对数轴可压缩上尾,让主体清晰;或标注“轴修剪”并附累积分布避免误导。
Q12 — Categorical EDA With Many Levels
There are 600 sku categories. Bar charts become unreadable.
📖 点击查看答案
Lump rare levels into “Other,” show top-N with cumulative share line; provide searchable table in appendix.
📝 点击查看解析
先做频次/金额排序,限前 N 并合并稀有类;配累积占比传达覆盖度,细节放表格。
Q13 — Multivariate Outliers
3-var dataset (price, promo depth, demand) shows no univariate outliers, but odd triplets exist.
📖 点击查看答案
Use scatter-matrix + robust Mahalanobis distance; tag multivariate outliers for review rather than drop blindly.
📝 点击查看解析
多元异常需要协方差结构识别;稳健马氏距离能发现“组合异常”。
Q14 — Small-n Visualization
Clinic pilot with 18 patients; outcome is binary. What plot?
📖 点击查看答案
Dot plots with exact binomial CIs by subgroup; avoid noisy histograms.
📝 点击查看解析
小样本要离散点与精确区间,避免直方图的伪连续性。
Q15 — Ranking Stores Fairly
Raw conversion rates rank small stores higher due to variance. How to rank?
📖 点击查看答案
Use empirical-Bayes shrinkage or Wilson intervals; rank by lower-bound of interval to reward confidence.
📝 点击查看解析
小样本高波动 → 收缩估计更公平;按可信下界排序,避免“运气王”占榜。
Q16 — EDA for A/B Pre-Check
Before launching an experiment, you must check balance between treatment/control historical covariates.
📖 点击查看答案
Use standardized mean differences (SMD) plots across covariates; ensure |SMD|<0.1.
📝 点击查看解析
仅看 p 值会受样本量影响;SMD是规模无关的平衡度量,常用阈值 0.1。
Q17 — Time Granularity Choice
Fraud signals at second-level timestamps; business reviews monthly. At which granularity to EDA?
📖 点击查看答案
Explore at event/second level to catch patterns, then roll-up to daily/weekly for communication; keep drill-down links.
📝 点击查看解析
先按信号发生粒度找规律,再匹配管理节奏做聚合呈现,保留可回溯路径。
Q18 — Reference Class Forecasting
CFO asks for expected cost of a new data center. Prior projects show optimistic bias.
📖 点击查看答案
Build reference class distribution (similar past builds), report P50/P80 cost from that empirical distribution; adjust current estimate toward outside view.
📝 点击查看解析
参照类法用历史结果纠正规划谬误,给出风险知情的预算点(如 P80 作为保守预算)。
Q19 — Label Noise Check
Manual labels for complaint types disagree 12% between raters.
📖 点击查看答案
Compute Cohen’s κ; if moderate, refine taxonomy & guidelines, run adjudication, then relabel a stratified sample.
📝 点击查看解析
先量化一致性,再通过标签定义与示例库提升 κ;对关键层级做复核仲裁。
Q20 — Communicating EDA Limits
Exec wants “definitive drivers” from EDA slides tomorrow.
📖 点击查看答案
Clarify EDA is for pattern discovery & hypothesis generation; propose a plan for confirmatory analysis/experiments.
📝 点击查看解析
EDA 不提供最终因果结论;需要后续验证(建模/实验)才能支撑决策。