๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Reinforcement Learning

๊ฐ•ํ™”ํ•™์Šต ๋ณต์Šต ์ž๋ฃŒ 3: Exploit & Exploration

by ํ–‰๋ฑ 2022. 3. 12.

 

Dummy Q-learning algorithm

 

๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ "Select an action a"์—์„œ ์–ด๋–ค action์„ ์„ ํƒํ•ด์•ผํ• ๊นŒ?

 

๋‹ต์€ < Exploit & Exploration > ์ด๋‹ค.

Exploit์€ ์ด๋ฏธ ์•„๋Š” ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๊ณ 

Exploration์€ randomํ•˜๊ฒŒ ๋ชจํ—˜ํ•˜๋Š” ๊ฒƒ์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

์•„๋Š” ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ข‹์ง€๋งŒ,

๋ณด๋‹ค ํšจ์œจ์ ์ธ ํ•ด๋‹ต์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ฅด๋Š” ๊ธธ๋กœ ๊ฐ€ ๋ณผ ํ•„์š”์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ E-greedy ๋ฐฉ๋ฒ•์ด๋‹ค.

์ž‘์€ ๊ฐ’ e๋ฅผ ์„ค์ •ํ•˜๊ณ , e์˜ ํ™•๋ฅ ๋กœ Exploration ํ•˜๋ฉฐ ๋‚˜๋จธ์ง€๋Š” Exploit ํ•œ๋‹ค.

# 1-1) E-greedy
e = 0.1
if rand < e:
   a = random
else:
   a = argmax(Q(s, a))
 

๋‹ค๋งŒ ๊ฐˆ ์ˆ˜๋ก e๊ฐ’์„ ์ž‘๊ฒŒํ•˜์—ฌ Explorationํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ Decaying E-greedy ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•œ๋‹ค.

# 1-2) Decaying E-greedy
for i in range(1000):
   e = 0.1 / (i+1)
   if rand < e:
      a = random
   else:
      a = argmax(Q(s, a))
 

๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Add random noise ์ด๋‹ค.

๊น€์„ฑํ›ˆ ๊ต์ˆ˜๋‹˜์˜ '์ ์‹ฌ ๋ฉ”๋‰ด ๊ณ ๋ฅด๊ธฐ' ๋น„์œ ๋ฅผ ๊ธฐ์–ตํ•˜์ž.

# 2-1) Add random noise
a = argmax(Q(s, a) + random_values)
 

์ด ๋ฐฉ๋ฒ•์—๋„ decaying์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

# 2-2) Add decaying random noise
for i in range(1000):
   a = argmax(Q(s, a) + random_values / (i+1))
 

E-greedy๊ฐ€ ์ ์€ ํ™•๋ฅ ๋กœ ์™„์ „ํžˆ randomํ•œ action์„ ๊ณ ๋ฅธ๋‹ค๋ฉด,

Add random noise๋Š”

randomํ•œ ๊ฐ’์ด Q(s, a)์— ๋”ํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ์™„์ „ํžˆ random์ด๋ผ๊ธฐ ๋ณด๋‹ค๋Š”

๋น„๊ต์  ๊ฐ’์ด ๋†’์€ 2๋ฒˆ์งธ, 3๋ฒˆ์งธ ๋†’์€ ๊ฐ’์ด ์ž˜ ์„ ์ •๋œ๋‹ค.

 

 

Reference: Sung Kim ๋ชจ๋‘๋ฅผ ์œ„ํ•œ RL๊ฐ•์ขŒ ์ •๋ฆฌ

์ž‘์„ฑ์ผ: 2018. 10. 4.

๋Œ“๊ธ€