【IT168 技术】在作者持续探索温布尔登数据过程中,他想要弄清楚一个运动员所表现的,是否达到了他被当做种子选手的预期值。
因此,作者想找出运动员中,某一轮真实结果和期望值之间的差异。在数据集上,“一轮比赛”相当于一个有序因子变量。
以下是所有可能的值:
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
如果想将这一对字符串分解成因子,我们需要这样做:
round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
在这种情况下,实际值和期待值之间的差异应该是:该球员有望赢得锦标赛,但是会在决赛中输掉。我们可以通过对每个变量调用unclass功能,来计算出他们的差值:
> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter" "Round of 128" "Round of 64" "Round of 32" "Round of 16" "Quarter-Finals"
[7] "Semi-Finals" "Finals" "Winner"
然后,似乎仍然有一些残余的变量因素需要解决掉,我们可以赋数字值给它:
> as.numeric(unclass(round) - unclass(expected))
[1] -1
这时,我们就可以把这个计算方法应用到所有种子选手下,来预测他们的表现。
译文原文:Calculating the Difference Between Ordered Factor Variables