๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ72

๊ฐ•ํ™”ํ•™์Šต ๋ณต์Šต ์ž๋ฃŒ 3: Exploit & Exploration ๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ "Select an action a"์—์„œ ์–ด๋–ค action์„ ์„ ํƒํ•ด์•ผํ• ๊นŒ? ๋‹ต์€ ์ด๋‹ค. Exploit์€ ์ด๋ฏธ ์•„๋Š” ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๊ณ  Exploration์€ randomํ•˜๊ฒŒ ๋ชจํ—˜ํ•˜๋Š” ๊ฒƒ์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์•„๋Š” ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ข‹์ง€๋งŒ, ๋ณด๋‹ค ํšจ์œจ์ ์ธ ํ•ด๋‹ต์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ฅด๋Š” ๊ธธ๋กœ ๊ฐ€ ๋ณผ ํ•„์š”์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ E-greedy ๋ฐฉ๋ฒ•์ด๋‹ค. ์ž‘์€ ๊ฐ’ e๋ฅผ ์„ค์ •ํ•˜๊ณ , e์˜ ํ™•๋ฅ ๋กœ Exploration ํ•˜๋ฉฐ ๋‚˜๋จธ์ง€๋Š” Exploit ํ•œ๋‹ค. # 1-1) E-greedy e = 0.1 if rand < e: a = random else: a = argmax(Q(s, a)) ๋‹ค๋งŒ ๊ฐˆ ์ˆ˜๋ก e๊ฐ’์„ ์ž‘๊ฒŒํ•˜์—ฌ Exploratio.. 2022. 3. 12.
๊ฐ•ํ™”ํ•™์Šต ๋ณต์Šต ์ž๋ฃŒ 2: Dummy Q-learning algorithm Q-learning์˜ ๊ธฐ๋ณธ ์‹์„ ์ด๋Œ์–ด ๋‚ด๊ธฐ ์œ„ํ•ด ํ•œ '๋ฏฟ์Œ'์„ ์‚ดํŽด๋ณด์ž. 1. ๋จผ์ € ๋‚˜๋Š” s์— ์žˆ๊ณ  2. action a๋ฅผ ์ทจํ•˜๋ฉด s'์œผ๋กœ ์ด๋™ํ•˜๋ฉฐ reward r์„ ๋ฐ›๋Š”๋‹ค. ์—ฌ๊ธฐ์„œ, s'์— Q๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ฏฟ์ž. s'์— Q๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ฏฟ์ž๋Š” ๊ฒƒ์˜ ์˜๋ฏธ๋Š” ์•„๋งˆ๋„ (s์—์„œ a๋ฅผ ์ทจํ•ด ๋ณ€ํ•œ state) s'์—์„œ ์–ด๋–ค action์„ ์ทจํ•ด์„œ ๋ฐ›์„ reward๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž๋Š” ๊ฒƒ์ธ ๊ฒƒ ๊ฐ™๋‹ค. (ํŠน์ • action์ด๋ผ๊ธฐ ๋ณด๋‹ค ๊ทธ ์–ด๋–ค action์— ๋Œ€ํ•œ reward๋ผ๋„) ์ด์ œ Q(s, a)๋ฅผ Q(s', a')์„ ์ด์šฉํ•ด ๋‚˜ํƒ€๋‚ด๋ณด๋ฉด, Q(s, a) = r + max(a') Q(s', a') r์€ s์—์„œ a๋ฅผ ์ทจํ•ด ์ฆ‰๊ฐ์ ์œผ๋กœ ์–ป์€ reward์ด๋ฉฐ max(a') Q(s', a')์€ ๊ทธ ์ดํ›„ ๋‹จ๊ณ„์—์„œ ์–ป์„ ์ตœ๋Œ€ rewa.. 2022. 3. 12.
๊ฐ•ํ™”ํ•™์Šต ๋ณต์Šต ์ž๋ฃŒ 1: Concept of RL ๊ฐ•ํ™”ํ•™์Šต์€ ๊ธฐ๋ณธ์ ์œผ๋กœ, ์ฅ(Actor ํ˜น์€ Agent)๊ฐ€ action์„ ์ทจํ•˜๋ฉด ๊ทธ์— ๋”ฐ๋ฅธ reward๋ฅผ ๋ฐ›๊ณ , ๋ณ€ํ™”๋œ state๋ฅผ ๊ด€์ฐฐํ•˜์—ฌ ๋‹ค์‹œ action์„ ์ทจํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. Q ํ•จ์ˆ˜์— state์™€ action์„ ์ฃผ๋ฉด, ๊ทธ์— ๋”ฐ๋ฅธ reward๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ž. (๊ทธ๋Ÿฌํ•œ Q ํ˜•๋‹˜์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.) Agent๊ฐ€ ์•Œ๊ณ  ์‹ถ์€ ๊ฒƒ์€ ์ตœ๋Œ€ reward๋ฅผ ๋งŒ๋“œ๋Š” action์ด๋‹ค. ์ด ๋‚ด์šฉ์„ ๋‹ค์Œ ๋‘ ์ˆ˜ํ•™์  ํ‘œ๊ธฐ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. max(a') Q(s, a') : s ๋ผ๋Š” state์— a'์„ ๋ฐ”๊พธ์–ด ์คŒ์œผ๋กœ์จ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ reward๊ฐ’ (Q๊ฐ’) argmax(a') Q(s, a'): (์œ„์™€ ์ด์–ด์ง€๋Š” ์ƒํ™ฉ์—์„œ) ์ตœ๋Œ€ Q๊ฐ’์„ ๊ฐ–๊ฒŒํ•˜๋Š” argument a' Reference: Sung Kim ๋ชจ๋‘๋ฅผ ์œ„ํ•œ R.. 2022. 3. 12.
django ์‹œ์ž‘ํ•˜๊ธฐ 1. pipenv shell (๋ฒ„๋ธ” ์•ˆ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ) 2. python3 manage.py runserver 3. Ctrl+C๋กœ ์„œ๋ฒ„ํ‚ฌํ•˜๊ณ  python3 manage.py migrate 4. python3 manage.py createsuperuser 5. python3 manage.py runserver๋กœ ์„œ๋ฒ„ ๊ตฌ๋™ํ•˜๋ฉด 127.0.0.1:8000/localhost ์—์„œ ๋กœ๊ทธ์ธ ๊ฐ€๋Šฅ => ๊ด€๋ฆฌ์ž ํŒจ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ 2022. 1. 4.