山海之间

R语言学习11-数据处理

R语言学习11-数据处理
2020-06-10 · 7 min read
dplyr R语言 教程

本次课程我们学习dplyrtidyr包中的一些常用函数对数据的处理。

worldcup数据集

假设我们已经在工作空间中读入了worldcup数据,这个数据概况是这样的。

> head(worldcup)
               Team   Position Time Shots Passes Tackles Saves
Abdoun      Algeria Midfielder   16     0      6       0     0
Abe           Japan Midfielder  351     0    101      14     0
Abidal       France   Defender  180     0     91       6     0
Abou Diaby   France Midfielder  270     1    111       5     0
Aboubakar  Cameroon    Forward   46     2     16       0     0
Abreu       Uruguay    Forward   72     0     15       0     0

现在我们想取其中的Time, Passes, Tackles, Saves4列,就可以使用dplyr中的select函数。

> wc_1 <- worldcup %>% select(Time, Passes, Tackles, Saves)
> head(wc_1)
           Time Passes Tackles Saves
Abdoun       16      6       0     0
Abe         351    101      14     0
Abidal      180     91       6     0
Abou Diaby  270    111       5     0
Aboubakar    46     16       0     0
Abreu        72     15       0     0

上述代码中%>%dplyr中的管道操作符,功能类似linux中的|,作用就是将上一步的结果传给下一步作为输入。

因此worldcup %>% select(Time, Passes, Tackles, Saves) 其实就等价于 select(worldcup, Time, Passes, Tackles, Saves)

在对数据进行多步操作时,管道操作符是很有用的一个东西。可以避免使用很多变量,以及节省内存空间。下面的操作都会基于管道操作符。

下面我们基于上述代码,对取得的4列数据进行取平均值。

> wc_2 <- worldcup %>%
+ select(Time, Passes, Tackles, Saves) %>%
+ summarise(Time = mean(Time), Passes = mean(Passes), Tackles = mean(Tackles), Saves = mean(Saves))
> wc_2
      Time   Passes  Tackles     Saves
1 208.8639 84.52101 4.191597 0.6672269

通过summarise()函数可以对数据做汇总,取均值,求和或者其他操作都可以。

现在我们希望把取得结果做一个转换,变成下面这种形式。

      var           mean
     Time    208.8638655
   Passes     84.5210084
  Tackles      4.1915966
    Saves      0.6672269

这个时候就可以用到tidyr中的gather()函数,这个函数的作用就是把放在行里的数据,转成放在列里。

> wc_3 <- worldcup %>% 
+     select(Time, Passes, Tackles, Saves) %>%
+     summarize(Time = mean(Time),
+               Passes = mean(Passes),
+               Tackles = mean(Tackles),
+               Saves = mean(Saves)) %>%
+     gather(var, mean)
> wc_3
      var        mean
1    Time 208.8638655
2  Passes  84.5210084
3 Tackles   4.1915966
4   Saves   0.6672269

gather()的第一个参数,实际是前面步骤传入的数据,第二个参数var其实是给原来的表头那行转成列的列名,第三个参数mean则是原来的行值,转列后的列名。

最后,我们发现mean列的小数位数太多了,我们只想取1位小数。mutate()函数可以在原来的数据基础上新建列,或代替原来的列。所以我们可以新建mean列并取1位小数,来代替原来的mean列。

> wc_4 <- worldcup %>% 
+     select(Time, Passes, Tackles, Saves) %>%
+     summarize(Time = mean(Time),
+               Passes = mean(Passes),
+               Tackles = mean(Tackles),
+               Saves = mean(Saves)) %>%
+     gather(var, mean) %>%
+     mutate(mean = round(mean, 1))
> wc_4
      var  mean
1    Time 208.9
2  Passes  84.5
3 Tackles   4.2
4   Saves   0.7

Titanic数据集

我们使用titanic数据集来学习以下其他操作,第一步依然是使用select()来选择其中的几列。

> titanic_1 <- titanic %>% 
+     select(Survived, Pclass, Age, Sex)
> head(titanic_1)
  Survived Pclass Age    Sex
1        0      3  22   male
2        1      1  38 female
3        1      3  26 female
4        1      1  35 female
5        0      3  35   male
6        0      3  NA   male

我们看到Age列的第6行,有NA值。现在我们想把Age列是NA的行全都删掉,可以使用filter()函数。

> titanic_2 <- titanic %>% 
+     select(Survived, Pclass, Age, Sex) %>%
+     filter(!is.na(Age))
> head(titanic_2)
  Survived Pclass Age    Sex
1        0      3  22   male
2        1      1  38 female
3        1      3  26 female
4        1      1  35 female
5        0      3  35   male
6        0      1  54   male

filter()会保留满足条件的行,因此is.na前面要加!,表示取反。

接下来,我们要对Age的值进行分类,分成3类,分别是Under 1515 to 50Over 50。因此我们要新建一个agecat列,使用mutate()函数。分类则使用cut函数。

> titanic_3 <- titanic %>% 
+     select(Survived, Pclass, Age, Sex) %>%
+     filter(!is.na(Age)) %>%
+     mutate(agecat = cut(Age, breaks = c(0, 14.99, 50, 150),
+                         include.lowest = TRUE,
+                         labels = c("Under 15", "15 to 50", "over 50")))
> head(titanic_3)
  Survived Pclass Age    Sex   agecat
1        0      3  22   male 15 to 50
2        1      1  38 female 15 to 50
3        1      3  26 female 15 to 50
4        1      1  35 female 15 to 50
5        0      3  35   male 15 to 50
6        0      1  54   male  over 50

最后,我们对SurvivedPclassagecat3列进行分组聚合操作,找出总人数和幸存人数,并计算幸存率。

> titanic_4 <- titanic %>% 
+     select(Survived, Pclass, Age, Sex) %>%
+     filter(!is.na(Age)) %>%
+     mutate(agecat = cut(Age, breaks = c(0, 14.99, 50, 150), 
+                         include.lowest = TRUE,
+                         labels = c("Under 15", "15 to 50",
+                                    "Over 50"))) %>%
+     group_by(Pclass, agecat, Sex) %>%
+     summarise(N = n(),
+               survivors = sum(Survived == 1),
+               perc_survived = 100 * survivors / N) %>% ungroup()
> titanic_4
# A tibble: 18 x 6
   Pclass agecat   Sex        N survivors perc_survived
    <int> <fct>    <chr>  <int>     <int>         <dbl>
 1      1 Under 15 female     2         1         50   
 2      1 Under 15 male       3         3        100   
 3      1 15 to 50 female    70        68         97.1 
 4      1 15 to 50 male      72        32         44.4 
 5      1 Over 50  female    13        13        100   
 6      1 Over 50  male      26         5         19.2 
 7      2 Under 15 female    10        10        100   
 8      2 Under 15 male       9         9        100   
 9      2 15 to 50 female    61        56         91.8 
10      2 15 to 50 male      78         5          6.41
11      2 Over 50  female     3         2         66.7 
12      2 Over 50  male      12         1          8.33
13      3 Under 15 female    27        13         48.1 
14      3 Under 15 male      27         9         33.3 
15      3 15 to 50 female    74        33         44.6 
16      3 15 to 50 male     217        29         13.4 
17      3 Over 50  female     1         1        100   
18      3 Over 50  male       9         0          0   

小结

学习dplyrtidyr中几个常用函数的使用。

  • select()
  • group_by()
  • summarise()
  • mutate()
  • filter()
  • gather()

本文首发于公众号:柠檬培养师(ID: yantinger90),欢迎关注!

Powered by Gridea,浙ICP备17039354号-1,© 2019 - 2020🍋