Quiz 1.2 （选做）

Lecture 1 — Data & Statistics (20 Advanced Application Questions)

Q1 — Choosing Data Type for a KPI Redesign

A bank wants to track “customer relationship depth” for cross-selling. They propose to code it as: 0=None, 1=Single product, 2=Two products, 3=Three or more. The data team must pick the proper scale of measurement and explain what analyses are (not) appropriate.

📖 点击查看答案

Ordinal scale. Allowed: medians, percentiles, nonparametric tests, bar charts. Not appropriate: operations requiring equal intervals or meaningful ratios (e.g., arithmetic mean comparisons may be misleading; ratios are invalid).

📝 点击查看解析

该变量是有序等级（0<1<2<3），但相邻等级间距不保证相等，因此属于序数尺度。可比较大小与位置（中位数、分位数），但不宜当作等距数据做均值差或回归系数的线性解释，更不能做“3 是 1 的三倍”之类的比率推断。

Q2 — Cross-Section vs Time Series Decision

Your CFO asks whether to build a cross-section or time-series dataset to study drivers of late payments. You have one month to advise credit policy next quarter.

📖 点击查看答案

Start with cross-section (current customers, their attributes, and latest delinquency status) for quick policy drivers; complement with a short panel slice if feasible (recent 6–12 months) to capture dynamics.

📝 点击查看解析

期限紧 → 横截面能更快给出“此刻哪些特征与逾期相关”。如能追加近几期账龄演变构成短面板，可更稳健识别因果方向与滞后影响；纯时间序列更适合宏观率（如总体逾期率）的趋势预测，不适合个体层面的特征洞察。

Q3 — Avoiding Unethical Graphs

A marketing slide shows a bar chart with a truncated y-axis from 90 to 100, making tiny increases look huge. You must critique and fix it.

📖 点击查看答案

Critique: misleading scaling and visual exaggeration. Fix: start y-axis at 0, show data labels/CI, keep consistent intervals, add clear title/units.

📝 点击查看解析

伦理指南强调避免误导。截断坐标会夸大差异；应从零起点、标注度量单位与不确定性（如置信区间），并保持一致刻度与配色。

Q4 — Sampling: Observational vs Experimental

HR wonders if a new interview rubric improves hiring quality. Budget allows only historical data analysis for now.

📖 点击查看答案

This will be an observational study using past hires; caution about confounding. Recommend a phased A/B experimental rollout later (randomized rubric assignment) for causal evidence.

📝 点击查看解析

仅用历史数据无法控制混杂变量 → 相关不等于因果。建议先做观察性评估（基线），再设计随机对照实验获得因果识别。

Q5 — Variable Design: Ratio or Interval?

A retailer tracks app session length (minutes) and Net Promoter Score (−100 to 100). Classify each scale and one valid transformation per variable.

📖 点击查看答案

Session length: Ratio (true zero, meaningful ratios). Valid: log transform for skew. NPS: Interval-like index (arbitrary zero). Valid: standardization (z-score) for comparisons.

📝 点击查看解析

时长有真实零点，可比较倍数；NPS 的零并非“无满意度”，更像等距指标，适合中心化/标准化，不宜做比例比较。

Q6 — Data Errors and Controls

Finance found negative inventory values after an ERP migration. Propose two pre-processing controls and one post-migration audit.

📖 点击查看答案

Pre-controls: (1) schema & unit mapping checklist; (2) constraint rules (non-negative, integer where needed). Post-audit: stratified sample reconciliation vs source docs.

📝 点击查看解析

迁移前统一字段与单位并设置完整性约束，迁移后用分层抽样对账，定位规则/映射错误的集中区域。

Q7 — Choosing Summary Statistics

CEO wants a single “typical” salary number for PR. Pay is right-skewed with a few star earners.

📖 点击查看答案

Use median (robust to outliers) and accompany with IQR; avoid using the mean alone.

📝 点击查看解析

右偏分布均值被极端值拉高，中位数 + IQR能更真实反映“典型水平”。

Q8 — Dashboard KPI Integrity

A dashboard shows weekly revenue mean without customer count. It spikes when low-volume weeks include one large deal. How to stabilize?

📖 点击查看答案

Add weighted KPIs (ARPU = revenue/customers), show distribution (median, IQR), and include volume context (n).

📝 点击查看解析

没有分母与样本量会误导。用单位客户指标、稳健统计量和样本量共同呈现，避免“大单幻觉”。

Q9 — Categorical vs Quantitative Encoding

Logistics tracks “delivery window” as Morning/Afternoon/Evening. Analyst encoded Morning=1, Afternoon=2, Evening=3 and ran linear regression.

📖 点击查看答案

That imposes false ordinality. Use one-hot/dummy variables or treat as ordinal only if domain justifies monotone spacing.

📝 点击查看解析

类别型变量乱设数值会引入伪顺序与等距假设。默认用哑变量；若业务明确“越晚越差”且相邻间距近似，可考虑有序模型。

Q10 — Population vs Sample Framing

A churn analysis uses all current customers (200k) and calls it “population,” concluding p-values are unnecessary.

📖 点击查看答案

It’s the population at this time, but inference targets future cohorts; uncertainty remains. Keep out-of-sample validation and intervals.

📝 点击查看解析

描述当前总体 ≠ 推断未来。预测仍需度量泛化不确定性（交叉验证/留出集、置信区间或预测区间）。

Q11 — KPI Manipulation Risk

A plant is rated by “defect rate per inspected unit.” They start inspecting fewer units.

📖 点击查看答案

Metric gaming. Redesign KPI to defects per produced unit and add audit sampling.

📝 点击查看解析

指标应与目标一致，减少可被策略性操纵的空间；过程审计可作为外部约束。

Q12 — Selecting Visualization for Stakeholders

Legal team wants to compare complaint categories across two years and highlight which categories grew the most in share.

📖 点击查看答案

Use stacked 100% bar charts (share), plus delta labels or a dumbbell chart of shares.

📝 点击查看解析

关注结构占比变化 → 100%堆叠条形能直观比较；若要精确差值，用哑铃图呈现两年份占比与差值。

Q13 — Data Source Trade-offs

You can purchase a costly consumer panel dataset or scrape social media mentions for free. The research question is “average monthly spend by life-stage.”

📖 点击查看答案

Buy the panel (structured, spending ground truth). Social mentions lack reliable denominators and sampling frame; use for qualitative signals only.

📝 点击查看解析

研究目标是量化均值，需有代表性与交易口径；社媒数据存在抽样偏差与不可观测分母，更适合作为探索性佐证。

Q14 — Construct Validity Check

A “financial literacy score” sums correct answers (0–10). Marketing correlates it with “investment assets ($).” They claim assets cause higher literacy.

📖 点击查看答案

Beware reverse causality and confounding (education, age). Treat as association; propose instrumental variable or longitudinal design.

📝 点击查看解析

相关≠因果，且资产与素养可能互相影响；通过工具变量或追踪研究提高因果识别度。

Q15 — Unit Consistency in Data Warehouse

Sales uploads prices sometimes in USD, sometimes in CNY, same column name.

📖 点击查看答案

Enforce unit metadata at ingestion, add currency field + FX normalization layer, and block loads with mixed units per batch.

📝 点击查看解析

单位是数据质量关键元数据；需在源头校验与统一换算两端把关，避免后续分析偏差。

Q16 — Choosing Central Tendency for SLA

IT commits to a “typical ticket resolution time” SLA. Resolution times are heavy-tailed.

📖 点击查看答案

Use median SLA (e.g., “50% within 4h”) + percentile targets (e.g., P90 within 24h).

📝 点击查看解析

重尾分布下均值受极端拖长；以分位数约束用户体验更贴近真实感受与业务可达性。

Q17 — Designing a Pilot with Ethical Bounds

To test a pricing change, you consider randomizing prices across customers. Legal warns about fairness.

📖 点击查看答案

Use geo-level or time-window randomization with pre-declared bounds and oversight; monitor harm metrics; add opt-out where applicable.

📝 点击查看解析

在伦理框架下做实验：分层/地域随机、设置价格波动上限、持续监控不利影响并提供退出机制。

Q18 — From Descriptive to Predictive

Analytics team produced great dashboards but no actions. Propose one path from description → prediction → prescription.

📖 点击查看答案

Define a decision (e.g., churn retention offer), build a predictive model (churn risk), and implement a policy (offer rules) with A/B testing.

📝 点击查看解析

描述 → 预测 → 处方：先明确可操作决策，再用模型排序对象，最后用实验验证策略收益与副作用。

Q19 — Handling Big Data Overfitting

A rich feature set yields an AUC of 0.95 on training but 0.68 on test.

📖 点击查看答案

Overfitting. Apply regularization, feature pruning, nested CV, and keep a hold-out for final validation.

📝 点击查看解析

训练集表现过好而泛化差 → 需要惩罚复杂度、减少冗余特征，并用更严谨的验证方案控制乐观偏差。

Q20 — Communicating Statistical Inference to Execs

Your “battery life improvement” test shows p=0.03 for +5% mean increase. Exec asks: “So we’re 97% sure it’s better?”

📖 点击查看答案

No. p=0.03 is not the probability the hypothesis is true. It’s the chance of seeing data this extreme if no improvement. Report effect size + CI and practical impact.

📝 点击查看解析

纠正常见误解：p 值不等于假设为真的概率。更应呈现效应量/置信区间与业务含义（如续航+5%≈多 30 分钟）。

Q1 — Binning Choice and Business Impact

Your app’s session length (minutes) is right-skewed. A PM proposes 5 equal-width bins for the homepage dashboard. Data team worries about misleading “most users in 0–5” messaging.

📖 点击查看答案

Use quantile (equal-count) binning or report median + P90; accompany with a density/histogram using log scale.

📝 点击查看解析

等宽分箱在右偏时会把大量短会话挤在首箱，夸大“多数超短”。等频分箱或对数坐标更稳健；以中位数+P90沟通“典型体验与上尾”。

Q2 — Outlier Policy vs Root Cause

Defect times show a few 10× spikes. Ops asks to winsorize at P99 and move on. QA asks to trace root cause first.

📖 点击查看答案

Root-cause first (trace batch/tool/operator), then choose context-aware handling: tag as special-cause, analyze separately; only winsorize for robust summary reporting.

📝 点击查看解析

离群点可能代表可纠正的特殊原因，直接截尾会掩盖质量问题。区分共因/特因，运营改进优先，统计汇总再稳健处理。

Q3 — Choosing the “Typical” Basket

Grocery wants a “typical basket size” KPI for store ranking. Distribution is bimodal (quick trips vs stock-up).

📖 点击查看答案

Report two medians or mixture-aware stats (e.g., medians by mission type) rather than a single overall mean.

📝 点击查看解析

双峰反映两类顾客任务；单一中心趋势会误导。分群后汇总更可操作（运营到货、排班）。

Q4 — Boxplot vs Violin for Execs

For payroll equity, you must show pay level differences across regions to a non-technical board.

📖 点击查看答案

Use boxplots with notches (median CI) and annotated IQR; keep violins optional in appendix.

📝 点击查看解析

箱线图直观呈现中位数、四分位、离群点；韧性强、认知成本低；提到notch可粗略比较中位差异显著性。

Q5 — Missing Data Mechanism Test

Credit limit has 7% missing; exploratory plots show missingness higher among low-income segments.

📖 点击查看答案

Likely MAR (missing at random conditional on income). Visualize missingness indicators vs covariates; impute with stratified/model-based methods.

📝 点击查看解析

缺失与可观测变量相关 → 条件随机缺失。先做缺失图/热图与logit-missingness探查，再分层/模型插补，保留缺失指示变量用于下游模型。

Q6 — Simpson’s Paradox Alarm

Company-wide, conversion A<B; within each channel, A>B. Marketing asks which variant to ship.

📖 点击查看答案

Ship A, but stratify rollout by channel mix; report weighted overall and by-channel effects.

📝 点击查看解析

这是辛普森悖论：总体权重（渠道占比）混淆。按渠道分层决策，并透明披露加权口径。

Q7 — Pareto and Long Tail Actions

Skus show 25% contribute 88% of GMV; tail has thousands of items with sporadic sales.

📖 点击查看答案

Apply ABC segmentation (A≈88% GMV), tighten replenishment on A, test assortment pruning on C with guardrails (seasonal/strategic SKUs).

📝 点击查看解析

80/20（帕累托）现象指导库存与陈列优先级；长尾不等于全砍，需要基于季节性/品牌战略设置豁免。

Q8 — Detecting Seasonality in EDA

Daily orders show weekly cycles and holiday spikes. Manager wants a simple seasonal index.

📖 点击查看答案

Use multiplicative seasonal indices (weekday factors), compute via ratio-to-moving-average; flag holidays as special events.

📝 点击查看解析

先用滑动平均去趋势，再按日别比得出季节指数；节假日属异常事件，单列标签避免污染常规指数。

Q9 — Correlation Trap

Scatter of ad spend vs revenue shows r=0.78. After de-seasonalization, r drops to 0.22.

📖 点击查看答案

Seasonality confounding. Always detrend/de-seasonalize before correlating; use partial correlation controlling for time.

📝 点击查看解析

共同季节性会产生伪相关。去季节/去趋势或用偏相关更真实反映广告边际作用。

Q10 — Feature Leakage During EDA

Churn label is defined using “no purchase in next 90 days.” Analyst plots features including “days since next purchase.”

📖 点击查看答案

That’s leakage. Restrict EDA to information available at decision time; separate t−window features from t+ outcomes.

📝 点击查看解析

任何使用未来信息的特征都会夸大预测力并误导洞察。时序切分是 EDA 基本纪律。

Q11 — Choosing a Robust Scale for Plot

Delivery times have extreme outliers. Histogram hides body. What axis/transform?

📖 点击查看答案

Plot on log scale or use trimmed axis with clear disclosure; consider ECDF for full distribution.

📝 点击查看解析

对数轴可压缩上尾，让主体清晰；或标注“轴修剪”并附累积分布避免误导。

Q12 — Categorical EDA With Many Levels

There are 600 sku categories. Bar charts become unreadable.

📖 点击查看答案

Lump rare levels into “Other,” show top-N with cumulative share line; provide searchable table in appendix.

📝 点击查看解析

先做频次/金额排序，限前 N 并合并稀有类；配累积占比传达覆盖度，细节放表格。

Q13 — Multivariate Outliers

3-var dataset (price, promo depth, demand) shows no univariate outliers, but odd triplets exist.

📖 点击查看答案

Use scatter-matrix + robust Mahalanobis distance; tag multivariate outliers for review rather than drop blindly.

📝 点击查看解析

多元异常需要协方差结构识别；稳健马氏距离能发现“组合异常”。

Q14 — Small-n Visualization

Clinic pilot with 18 patients; outcome is binary. What plot?

📖 点击查看答案

Dot plots with exact binomial CIs by subgroup; avoid noisy histograms.

📝 点击查看解析

小样本要离散点与精确区间，避免直方图的伪连续性。

Q15 — Ranking Stores Fairly

Raw conversion rates rank small stores higher due to variance. How to rank?

📖 点击查看答案

Use empirical-Bayes shrinkage or Wilson intervals; rank by lower-bound of interval to reward confidence.

📝 点击查看解析

小样本高波动 → 收缩估计更公平；按可信下界排序，避免“运气王”占榜。

Q16 — EDA for A/B Pre-Check

Before launching an experiment, you must check balance between treatment/control historical covariates.

📖 点击查看答案

Use standardized mean differences (SMD) plots across covariates; ensure |SMD|<0.1.

📝 点击查看解析

仅看 p 值会受样本量影响；SMD是规模无关的平衡度量，常用阈值 0.1。

Q17 — Time Granularity Choice

Fraud signals at second-level timestamps; business reviews monthly. At which granularity to EDA?

📖 点击查看答案

Explore at event/second level to catch patterns, then roll-up to daily/weekly for communication; keep drill-down links.

📝 点击查看解析

先按信号发生粒度找规律，再匹配管理节奏做聚合呈现，保留可回溯路径。

Q18 — Reference Class Forecasting

CFO asks for expected cost of a new data center. Prior projects show optimistic bias.

📖 点击查看答案

Build reference class distribution (similar past builds), report P50/P80 cost from that empirical distribution; adjust current estimate toward outside view.

📝 点击查看解析

参照类法用历史结果纠正规划谬误，给出风险知情的预算点（如 P80 作为保守预算）。

Q19 — Label Noise Check

Manual labels for complaint types disagree 12% between raters.

📖 点击查看答案

Compute Cohen’s κ; if moderate, refine taxonomy & guidelines, run adjudication, then relabel a stratified sample.

📝 点击查看解析

先量化一致性，再通过标签定义与示例库提升 κ；对关键层级做复核仲裁。

Q20 — Communicating EDA Limits

Exec wants “definitive drivers” from EDA slides tomorrow.

📖 点击查看答案

Clarify EDA is for pattern discovery & hypothesis generation; propose a plan for confirmatory analysis/experiments.

📝 点击查看解析

EDA 不提供最终因果结论；需要后续验证（建模/实验）才能支撑决策。

Quartz 4

Explorer

Quiz 1.2 （选做）

Lecture 1 — Data & Statistics (20 Advanced Application Questions)

Q1 — Choosing Data Type for a KPI Redesign

Q2 — Cross-Section vs Time Series Decision

Q3 — Avoiding Unethical Graphs

Q4 — Sampling: Observational vs Experimental

Q5 — Variable Design: Ratio or Interval?

Q6 — Data Errors and Controls

Q7 — Choosing Summary Statistics

Q8 — Dashboard KPI Integrity

Q9 — Categorical vs Quantitative Encoding

Q10 — Population vs Sample Framing

Q11 — KPI Manipulation Risk

Q12 — Selecting Visualization for Stakeholders

Q13 — Data Source Trade-offs

Q14 — Construct Validity Check

Q15 — Unit Consistency in Data Warehouse

Q16 — Choosing Central Tendency for SLA

Q17 — Designing a Pilot with Ethical Bounds

Q18 — From Descriptive to Predictive

Q19 — Handling Big Data Overfitting

Q20 — Communicating Statistical Inference to Execs

Q1 — Binning Choice and Business Impact

Q2 — Outlier Policy vs Root Cause

Q3 — Choosing the “Typical” Basket

Q4 — Boxplot vs Violin for Execs

Q5 — Missing Data Mechanism Test

Q6 — Simpson’s Paradox Alarm

Q7 — Pareto and Long Tail Actions

Q8 — Detecting Seasonality in EDA

Q9 — Correlation Trap

Q10 — Feature Leakage During EDA

Q11 — Choosing a Robust Scale for Plot

Q12 — Categorical EDA With Many Levels

Q13 — Multivariate Outliers

Q14 — Small-n Visualization

Q15 — Ranking Stores Fairly

Q16 — EDA for A/B Pre-Check

Q17 — Time Granularity Choice

Q18 — Reference Class Forecasting

Q19 — Label Noise Check

Q20 — Communicating EDA Limits

Graph View

Table of Contents

Backlinks