Advanced R (2nd ed.)

· โ˜• 5 min read · โœ๏ธ Hoontaek Lee
๐Ÿท๏ธ
  • #Book Review
  • #Research
  • #R
  • #2020
  • Log

    • 2020-04-12: 23, 24.6

      ๊ณผํ•™์›์—์„œ ๊ธฐ์ƒ์ž๋ฃŒ QCํ•˜๋Š” R ์ฝ”๋“œ๋ฅผ ์งฐ๋Š”๋ฐ, ๊ฝค ์• ๋จน์—ˆ๋‹ค.

      ์ปดํ“จํ„ฐ ๋ฉ”๋ชจ๋ฆฌ๋Š” 8GB.

      ์ธํ’‹์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” csv ํŒŒ์ผ์ด ์ตœ๋Œ€ 2GB ์ •๋„๋กœ ํฌ๊ธด ํ–ˆ์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ์—๋Š” ๋ฌธ์ œ๊ฐ€ ์—†์–ด ๋ณด์˜€๋‹ค.

      ํ•˜์ง€๋งŒ ๋ถ„๋ช… ์–ธ๋œป ๋ณด๊ธฐ์— 8GB๊ฐ€ ๋„˜์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์€๋ฐ๋„ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰ ๋ถ€์กฑ์œผ๋กœ ์—๋Ÿฌ๊ฐ€ ์ž๊พธ ๋‚ฌ๋‹ค.

      ์ฝ์–ด๋“ค์ธ ํŒŒ์ผ์„ ์ˆ˜๋™์œผ๋กœ ์ชผ๊ฐ  ํ›„ ์—ด์‹ฌํžˆ QC๋ฅผ ์ง„ํ–‰ํ–ˆ์ง€๋งŒ… ๊ฒฐ๊ตญ ์นœ๊ตฌ๊ฐ€ ๋งํ•ด์ค€ ๊ธฐํ•œ์„ ๋„˜๊ฒจ๋ฒ„๋ ธ๋‹ค.

      ๋‚ด ์ถ”์ธก์—… ํ•จ์ˆ˜๊ฐ€ ์ž‘๋™ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋‚ด๋ถ€์ ์œผ๋กœ ๋ณ€์ˆ˜๋ฅผ ๋ณต์‚ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋‚ด ์˜ˆ์ƒ์„ ๋„˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

      ์•ž์œผ๋ก  csv ์šฉ๋Ÿ‰์„ ์กฐ๊ธˆ ์ค„์—ฌ์„œ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„ ๋“ฏํ•˜์ง€๋งŒ,

      ๊ทธ๋ž˜๋„ ์ด๋ฒˆ ๊ธฐํšŒ์— ์ฝ”๋“œ ์„ฑ๋Šฅ, ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์— ๊ด€ํ•œ ๋‚ด์šฉ์„ ์•Œ์•„๋‘๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์„œ Hadley Godkham์˜ ์ฑ…์„ ๋“ค์—ฌ๋‹ค๋ดค๋‹ค.

      ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋‚ด์šฉ์€ 23์žฅ์— ์žˆ์—ˆ๊ณ , ์ฝ”๋“œ ์„ฑ๋Šฅ(=run time) ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๊ด€๋ จ ๋‚ด์šฉ์ด ๋‹ด๊ฒจ ์žˆ๋‹ค.

    5. Techniques

    23. Measuring performances

    23.2 Profiling

    23.2.1. Visualising profiles

    Profiling์€ ์ฝ”๋“œ ํ•œ์ค„ํ•œ์ค„์˜ runtime์„ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.
    R์—์„œ๋Š”

    1. RStudio์˜ profile ๋ฉ”๋‰ด๋ฅผ ์ด์šฉํ•˜๊ฑฐ๋‚˜,
    2. utils::Rprof(), profvis::profvis() ๋“ฑ์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    profvis๊ฐ€ ๋ณด๊ธฐ ํŽธํ•˜๋ฏ€๋กœ ์ด๊ฒƒ๋งŒ ์ •๋ฆฌํ•ด์•ผ์ง€.

    ๋จผ์ € ํ”„๋กœํŒŒ์ผํ•  ์ƒ˜ํ”Œ ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•˜๊ณ ,

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    library(profvis)
    
    f <- function() {
      pause(0.1)
      g()
      h()
    }
    g <- function() {
      pause(0.1)
      h()
    }
    h <- function() {
      pause(0.1)
    }
    

    ํ”„๋กœํŒŒ์ผ ์ง„ํ–‰.
    ์ฝ”๋“œ๋Š” ๊ฐ„๋‹จํ•˜๋‹ค.

    1
    
    profvis(f())
    

    ์œ„ ๊ฒฐ๊ณผ๋Š” ์›๋ž˜ interactive html widget์ธ๋ฐ,
    .Rmd๋‚˜ .html์—์„œ๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ
    .md์—์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค.
    ์ผ๋‹จ .html์—์„œ ์บก์ณํ•œ ์ด๋ฏธ์ง€๋กœ ๋Œ€์ฒดํ–ˆ๋Š”๋ฐ, .md์—์„œ๋„ ์ž‘๋™ํ•˜๊ฒŒ ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•˜์ง€?

    Flame Graph ํƒญ์˜ ์•„๋ž˜ ํŒจ๋„์—์„œ๋Š” ์Šคํƒ ํ˜•ํƒœ๋กœ ๊ฐ ์ฝ”๋“œ์™€ ์‹คํ–‰ ์‹œ๊ฐ„์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ์˜ ํ”„๋กœํŒŒ์ผ ๊ฒฐ๊ณผ๋ฅผ ๋ˆˆ์œผ๋กœ ํ™•์ธํ•˜๊ธฐ ์ข‹๋‹ค.
    ์ฝ”๋“œ๊ฐ€ ๋ณต์žกํ•  ๋•Œ๋Š” Dataํƒญ์—์„œ ํ™•์ธํ•˜๋Š” ๊ฒŒ ํŽธํ•˜๋‹ค(๊ณ  ํ•œ๋‹ค). ํƒ์ƒ‰๊ธฐ ๊ตฌ์กฐ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    23.2.2. Memory profiling

    running time๋ฟ๋งŒ ์•„๋‹ˆ๋ผ memory ์‚ฌ์šฉ ๋‚ด์—ญ๋„ ํ”„๋กœํŒŒ์ผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.
    ํ”„๋กœํŒŒ์ผ ๊ฒฐ๊ณผ์— <GC> (i.e. garbage collector)๋ผ๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋” ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ฐ์ฒด๋ฅผ ์—†์• ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ์–ด๋–ป๊ฒŒ ์ž‘๋™๋˜๋Š”์ง€ ์‚ดํŽด๋ณด์ž.

    ์ƒ˜ํ”Œ garbage ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ํ”„๋กœํŒŒ์ผ๋งํ•˜๋ฉด…

    1
    2
    3
    4
    5
    6
    
    profvis::profvis({
      x <- integer()
    for (i in 1:1e4) {
      x <- c(x, i)
    }
    })
    


    ์œ„ ํŒจ๋„์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋ถ„์— ๋ง‰๋Œ€๊ฐ€ ์ƒ๊ธด ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
    ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋ป—์€ ๊ฑด ์‚ฌ์šฉ ๋œ ๋ฉ”๋ชจ๋ฆฌ,
    ์™ผ์ชฝ์œผ๋กœ ๋ป—์€ ๊ฑด <GC>๊ฐ€ ์ œ๊ฑฐํ•œ ๋ถˆํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰.

    ์ฐพ์•˜๋‹ค.

    ๋ณธ๋ฌธ์—์„œ

    Here the problem arises because of copy-on-modify (Section 2.3): each iteration of the loop creates another copy of x. Youโ€™ll learn strategies to resolve this type of problem in Section 24.6.

    ๋ผ๊ณ  ์–ธ๊ธ‰ํ•˜๋Š”๋ฐ,
    ์ €๊ธฐ copy-on-modify, creastes another copy of x ๋ถ€๋ถ„์ด ๋‚ด๊ฐ€ ์ถ”์ธกํ•œ ๋ถ€๋ถ„์ธ ๋“ฏํ•˜๋‹ค.
    ์ž์„ธํ•œ ์›์ธ์€ Section 2.3์„, ํ•ด๊ฒฐ๋ฒ•์€ Section 24.6์„ ์ฐธ๊ณ ํ•˜๋ž€๋‹ค.

    23.2.4. Exercises

    1. Profile the following function with torture = TRUE. What is surprising? Read the source code of rm() to figure out whatโ€™s going on.
    1
    2
    3
    4
    5
    6
    7
    
    profvis::profvis({
      f <- function(n = 1e5) {
        x <- rep(1, n)
        rm(x)
      }
    },
    torture = TRUE)
    

    ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค(‘integer'์€ ‘coerceToInteger'์—์„œ ๊ตฌํ˜„๋˜์ง€ ์•Š์€ ์œ ํ˜•์ž…๋‹ˆ๋‹ค). ์†”๋ฃจ์…˜์„ ๋ณด๋‹ˆ ์•„์ง ํ•ด๊ฒฐ์ด ์•ˆ ๋œ ๋“ฏ.

    23.3. Microbenchmarking

    Microbenchmark๋Š” profile์„ ์•„์ฃผ ์ž‘์€ ์ฝ”๋“œ ๋‹จ์œ„(i.e. ํ•จ์ˆ˜ ํ•˜๋‚˜)์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด csv ํŒŒ์ผ์„ ์ฝ๋Š” ํ•จ์ˆ˜ read.csv, readr::read_csv(), data.table::fread()์ด csv ํŒŒ์ผ ํ•˜๋‚˜ ์ฝ๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ด ์žˆ๊ฒ ๋‹ค.
    bench ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

    1
    2
    3
    4
    5
    
    x <- runif(100)
    (lb <- bench::mark(
      sqrt(x),
      x ^ 0.5
    ))
    
    ## # A tibble: 2 x 6
    ##   expression      min   median `itr/sec` mem_alloc `gc/sec`
    ##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    ## 1 sqrt(x)       900ns    1.1us   465768.      848B        0
    ## 2 x^0.5         5.8us    6.1us   114983.      848B        0
    

    ๊ฒฐ๊ณผ๋Š” tibble ํ˜•ํƒœ๋กœ ์ €์žฅ๋œ๋‹ค.
    ๊ฐ row์—๋Š” ๋น„๊ตํ•˜๋ ค๋Š” expression์ด ์ €์žฅ๋œ๋‹ค.
    ์—ฌ๊ธฐ์„œ๋Š” x์˜ ๋ฃจํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ sqrt() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ(1ํ–‰) ^ ์—ฐ์‚ฐ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•(2ํ–‰)์„ ๋น„๊ตํ–ˆ๋‹ค.
    sqrt() ํ•จ์ˆ˜๊ฐ€ ์–‘ 6๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

    ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด…

    1
    
    plot(lb)
    
    ## Loading required namespace: tidyr
    

    ์ด๋ ‡๊ฒŒ right-skewed ํ˜•ํƒœ์˜ ๊ทธ๋ฆผ์ด ๋‚˜ํƒ€๋‚œ๋‹ค(๊ทธ๋ž˜์„œ ํ‰๊ท ๋ณด๋‹ค๋Š” ์ตœ์†Ÿ๊ฐ’์„ ๋น„๊ตํ•˜์ž).

    23.3.3. Exercises

    1. Here are two other ways to compute the square root of a vector. Which do you think will be fastest? Which will be slowest? Use microbenchmarking to test your answers.
    1
    2
    
    x ^ (1 / 2)
    exp(log(x) / 2)
    

    ๋ฃจํŠธ ๊ณ„์‚ฐํ•˜๋Š” ๋‘ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•ด๋ณด๋ž€๋‹ค. ๋”ฑ ๋ด๋„ log ์”Œ์› ๋‹ค๊ฐ€ exp()๋กœ ๋˜๋Œ๋ฆฌ๋Š” ๊ฒŒ ๋Š๋ ค๋ณด์ธ๋‹ค.
    bench::mark()๋ฅผ ์‚ฌ์šฉํ•ด์„œ sqrt()๋„ ๊ฐ™์ด ๋น„๊ตํ•ด๋ณด์ž.

    1
    2
    3
    4
    5
    6
    7
    
    x <- runif(100)
    (lb <- bench::mark(
      sqrt(x),
      x ^ (1 / 2),
      exp(log(x) / 2)
    )
    )
    
    ## # A tibble: 3 x 6
    ##   expression         min   median `itr/sec` mem_alloc `gc/sec`
    ##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    ## 1 sqrt(x)            1us    1.1us   759671.      848B        0
    ## 2 x^(1/2)          6.1us    6.4us   141474.      848B        0
    ## 3 exp(log(x)/2)   13.6us   13.9us    59601.      848B        0
    

    ์ƒ๊ฐ๋Œ€๋กœ ๋‚˜์™”๋‹ค. ๋ˆ„๊ฐ€ ๋ฃจํŠธ ๊ตฌํ•  ๋•Œ ๊ตณ์ด log-exp๋ฅผ ์‚ฌ์šฉํ• ๊นŒ…
    mem_alloc ์—ด์—์„œ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ์–‘์€ ๋ชจ๋‘ ๊ฐ™์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
    ์•„๋ฌดํŠผ ์ด๋Ÿฐ ์‹์œผ๋กœ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    ๊ฒฐ๋ก 

    profvis() ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ 
    ์–ด๋Š ๋ถ€๋ถ„์—์„œ ์‹œ๊ฐ„์„ ๋งŽ์ด ์žก์•„๋จน๊ณ ,
    ์–ด๋Š ๋ถ€๋ถ„์—์„œ ๋‚ด ์ฝ”๋“œ์— ์“ฐ๋ ˆ๊ธฐ๊ฐ€ ๋งŽ์ด ์ƒ๊ธฐ๋Š”์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค.

    profvis::profvis()๊ฐ€ ์Šคํฌ๋ฆฝํŠธ ์ „์ฒด์— ์ˆ˜ํ–‰ํ•˜๋Š” ํ”„๋กœํŒŒ์ผ์ด๋ผ๋ฉด,
    bench::mark()๋Š” ํ•œ ๊ณผ์ •๊ณผ์ •์— ๋Œ€ํ•ด ์‹คํ–‰์‹œ๊ฐ„&๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ๋น„๊ตํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    ํ˜น์‹œ ์ฝ”๋“œ ์„ฑ๋Šฅ์„ ๊ณ ๋ฏผํ•ด์•ผ ํ•  ๋•Œ ์‚ฌ์šฉํ•ด๋ณด์ž.

    24. Improving performance

    24.6 Avoiding copies

    ์–ด๋–จ ๋•Œ copy๊ฐ€ ์ผ์–ด๋‚˜๋Š”์ง€ ์•Œ๋ ค์ค„ ๋ฟ, ํ•ด๊ฒฐ์ฑ…์€ ์•Œ๋ ค์ฃผ์ง€ ์•Š๋Š”๋‹ค.
    copy ์ผ์–ด๋‚˜๋Š” ๊ฒฝ์šฐ ์ž์ฒด๋ฅผ ํ”ผํ•˜๋ผ๋Š” ๋œป์ธ ๋“ฏ.

    c(), append(), cbind, rbind(), paste(), x[i] <- y ๋“ฑ์„ ์‚ฌ์šฉํ•˜๋ฉด
    ์ƒˆ๋กœ ๋งŒ๋“ค object๋ฅผ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๊ฒŒ ๋œ๋‹ค.
    c์–ธ์–ด์—์„œ ๋‘ ๋ณ€์ˆ˜ ๊ฐ’์„ ๋ฐ”๊พธ๊ณ  ์‹ถ์„ ๋•Œ, temp ๋“ฑ์˜ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™๋‹ค.

    x <- 1
    y <- 2
    temp <- x
    x <- y
    y <- x
    rm(temp)
    

    ์•„๋ž˜ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด copy๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ฝ”๋“œ๋ฅผ ๋Š๋ฆฌ๊ฒŒ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    random_string <- function() {
      paste(sample(letters, 50, replace = TRUE), collapse = "")
    }
    strings10 <- replicate(10, random_string())
    strings100 <- replicate(100, random_string())
    
    collapse <- function(xs) {
      out <- ""
      for (x in xs) {
        out <- paste0(out, x)
      }
      out
    }
    
    bench::mark(
      loop10  = collapse(strings10),
      loop100 = collapse(strings100),
      vec10   = paste(strings10, collapse = ""),
      vec100  = paste(strings100, collapse = ""),
      check = FALSE
    )[c("expression", "min", "median", "itr/sec", "n_gc")]
    
    ## # A tibble: 4 x 4
    ##   expression      min   median `itr/sec`
    ##   <bch:expr> <bch:tm> <bch:tm>     <dbl>
    ## 1 loop10       28.7us     32us    24857.
    ## 2 loop100      1.11ms   1.29ms      614.
    ## 3 vec10           7us    7.5us   112153.
    ## 4 vec100         49us   49.9us    17895.
    

    random_string()์„ ํ†ตํ•ด ๊ธธ์ด 50์˜ string์„ ์›ํ•˜๋Š” ๋งŒํผ ๋งŒ๋“ค๊ณ ,
    collapse()๋‚˜ paste()๋ฅผ ์ด์šฉํ•ด ํ•œ ์ค„๋กœ ๋ฐ”๊พผ๋‹ค.
    ์—ฌ๊ธฐ์„œ for๋ฌธ์˜ ๋‹จ์ ์ด ๋‚˜ํƒ€๋‚œ๋‹ค.
    paste()๋Š” 100๊ฐœ์งœ๋ฆฌ๊ฐ€ 10๊ฐœ์งœ๋ฆฌ๋ณด๋‹ค 8๋ฐฐ ์ •๋„ ๋Š๋ ธ๋˜ ๋ฐ˜๋ฉด,
    for๋ฌธ์—์„œ๋Š” ์•ฝ 30๋ฐฐ ๋Š๋ ธ๋‹ค.
    ์„ฑ๋Šฅ ์ƒ๊ฐํ•œ๋‹ค๋ฉด for ๋ฐ˜๋ณต๋ฌธ๋ณด๋‹ค๋Š” vectorized method๋ฅผ ์‚ฌ์šฉํ•˜๋ผ๋Š” ๊ฒŒ ์ด๋Ÿฐ ์ด์œ ๋‹ค.

    ๊ฒฐ๋ก 

    ์–ธ์ œ ๋ณต์‚ฌ๊ฐ€ ์ผ์–ด๋‚ ์ง€ ์•Œ์•„๋‘๊ณ , ๋˜๋„๋ก ํ”ผํ•˜๋ฉด์„œ ์ฝ”๋”ฉํ•˜์ž.

    Share on

    Hoontaek Lee
    WRITTEN BY
    Hoontaek Lee
    Tree-Forest-Climate Researcher

    What's on this Page