技术开发 频道

计算有序因子变量间的差异

  【IT168 技术】在作者持续探索温布尔登数据过程中,他想要弄清楚一个运动员所表现的,是否达到了他被当做种子选手的预期值。

  因此,作者想找出运动员中,某一轮真实结果和期望值之间的差异。在数据集上,“一轮比赛”相当于一个有序因子变量。

  以下是所有可能的值:

  rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")

  如果想将这一对字符串分解成因子,我们需要这样做:

round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)  
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner

  在这种情况下,实际值和期待值之间的差异应该是:该球员有望赢得锦标赛,但是会在决赛中输掉。我们可以通过对每个变量调用unclass功能,来计算出他们的差值:


> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter"  "Round of 128"   "Round of 64"    "Round of 32"    "Round of 16"    "Quarter-Finals"
[7] "Semi-Finals"    "Finals"         "Winner"

  然后,似乎仍然有一些残余的变量因素需要解决掉,我们可以赋数字值给它:

> as.numeric(unclass(round) - unclass(expected))
[1] -1

  这时,我们就可以把这个计算方法应用到所有种子选手下,来预测他们的表现。

译文原文:Calculating the Difference Between Ordered Factor Variables

0
相关文章