huggingface RL URL

  • Deep RL is a type of Machine Learning where an agent learnsย how to behaveย in an environmentย by performing actionsย andย seeing the results.
  • ๊ฐ•ํ™”ํ•™์Šต์€ agent(์ฃผ์ฒด)๊ฐ€ ํŠน์ • ํ™˜๊ฒฝ์—์„œ action(ํ–‰๋™)๊ณผ result(๊ฒฐ๊ณผ)์— ๋Œ€ํ•ด์„œ ์–ด๋–ป๊ฒŒ ๋ฐ˜์‘ํ•˜๋Š”์ง€์— ๋Œ€ํ•ด ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด๋‹ค.

What is Reinforcement Learing?

๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๋Š” AI๊ฐ€ action์— ๋Œ€ํ•ด์„œ reactionํ•˜๊ณ  ๋ฆฌ์›Œ๋“œ๋ฅผ ๋ฐ›์œผ๋ฉด์„œ ํ™˜๊ฒฝ์„ ๋ฐฐ์›Œ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด๋‹ค.

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents thatย learn from the environmentย byย interacting with itย through trial and error andย receiving rewardsย (positive or negative)ย as unique feedback.

๊ฐ•ํ™”ํ•™์Šต์€ ํŠน์ • feedback์„ ๊ฐ€์ง€๊ณ  ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฒฝํ—˜ํ•˜๋ฉฐ ๋ฐ›์€ ๋ณด์ƒ(+ || -)์„ ํ†ตํ•ด ํ™˜๊ฒฝ์— ๋ฐ˜์‘ํ•˜๋ฉด์„œ ๋ฐฐ์šด AI ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ํ†ตํ•ด์„œ control task (๊ฒฐ์ •์ ์ธ ๋ฌธ์ œ)๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” framework์ด๋‹ค.

The Reinforcement Learning Framework

![[Pasted image 20231009125404.png|]]

RL์ด ํ•™์Šต์„ ํ•˜๋Š” ๊ณผ์ •์„ ์œ„์— ๋„์‹๊ณผ ๊ฐ™๋‹ค.

  1. Environment์—์„œ Agent์—์„œ state๋ฅผ St๋ฅผ ์ฃผ๊ฒŒ ๋œ๋‹ค.
  2. Agent๋Š” St์— ๋Œ€ํ•ด At๋ฅผ actionํ•œ๋‹ค.
  3. Environment์—์„œ๋Š” ์ƒˆ๋กœ์šด state St+1๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.
  4. Environment๋Š” ๋˜ํ•œ At์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ๋กœ Rt๋ฅผ Agent์—๊ฒŒ feedbackํ•œ๋‹ค.

์ด 4๊ฐ€์ง€ ๊ณผ์ •์ด RL์ด ํ•™์Šต์„ ํ•˜๋Š” ๊ณผ์ •์ธ๋ฐ ํฌ๊ฒŒ state(์ƒํƒœ), Action(ํ–‰๋™-๋‚˜๋Š” ๋ฐ˜์‘์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค), Reward(๋ณด์ƒ), STATE(์ƒˆ๋กœ์šด ์ƒํƒœ)๋กœ ๊ตฌ๋ถ„๋œ๋‹ค.

๊ฐ•ํ™” ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๋ชฉํ‘œ๋Š” maximizing cumulative reward ์ฆ‰ ๋ˆ„์ ๋œ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

The reward hypothesis: the central idea of Reinforcement Learning

Why is the goal of the agent to maximize the expected return? ์™œ ๋ณด์ƒ์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ๊ฑฐ๋ƒ?

์ด์œ ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ reward hypothesis๋ฅผ ๊น”๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. reward hypothesis๋Š” ๋ชจ๋“  ํ•™์Šต์˜ ๋ชฉํ‘œ๋Š” ๊ธฐ๋Œ€ํ•˜๋Š” ๋ˆ„์  ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Markov Property

RL์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋“œ๋Š” ๋ง์ด MDP(Markov Decision Property)๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

์ด ๋ง์ด ์˜๋ฏธํ•˜๋Š” ๋ถ€๋ถ„์€ ์šฐ๋ฆฌ์˜ agent(Model)์€ ๋‹ค์Œ ํ–‰๋™์„ ์ทจํ•˜๊ธฐ ์œ„ํ•ด ํ˜„์žฌ์˜ ์ƒํƒœ St๋งŒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ด์•ผ๊ธฐ ํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ ์ด์ „์˜ ์ƒํƒœ(St-1 โ€ฆ)๋‚˜ ๋ฐ˜์‘(At-1 โ€ฆ)์ด ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

Observation/State Space

Observation/State Space๋Š” Agent(๋ชจ๋ธ)์ด ํ˜„์žฌ ์ƒํƒœ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ํ˜„์žฌ ํ™˜๊ฒฝ์„ ์˜๋ฏธํ•œ๋‹ค.

ํ•˜์ง€๋งŒ Observation ๊ณผ State์€ ์—„ํ˜„ํžˆ ๋‹ค๋ฅด๋‹ค.

State : ํ˜„์žฌ ์ƒํƒœ๋Š” ํ˜„์žฌ ์ƒํƒœ๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.(State์™ธ์— ์ˆจ๊ฒจ์ง„ ์ •๋ณด๊ฐ€ ์—†๋Š” ์ƒํƒœ) ์™„๋ฒฝํ•˜๊ฒŒ Observationํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

Observation : State์˜ ๋ถ€๋ถ„์„ ์„คํ•˜๋Š” ์–ด๋–ค ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ํ™˜๊ฒฝ์˜ ์ผ๋ถ€๋ถ„์„ ์ด์•ผ๊ธฐ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. e.g ์Šˆํผ๋งˆ๋ฆฌ์˜ค ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ์ „์ฒด ๋งต์ด ์—„์ฒญ ๋„“์€๋ฐ ๊ฒŒ์ž„ํ•˜๊ณ  ์žˆ๋Š” ๋ถ€๋ถ„๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ ์ด์ƒํƒœ๋ฅผ ์šฐ๋ฆฌ๋Š” Observationํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

Action Space

Action Space๋Š” ํ˜„์žฌ ํ™˜๊ฒฝ์—์„œ Agent๊ฐ€ Actionํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  Set์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฌํ•œ Action์€ discrete ํ˜น์€ continous space๋กœ ๋ถ€ํ„ฐ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Discrete Space : Action์ด ์œ ํ•œํ•œ ๊ฒฝ์šฐ
    • e.g. ์Šˆํผ๋งˆ๋ฆฌ์˜ค์˜ ๊ฒฝ์šฐ์—๋Š” letf, right, up, down์€ ๋ฌดํ•œ๋Œ€๋กœ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ฒฝ์šฐ ์ด 4๊ฐ€์ง€ ๊ฒฝ์šฐ๋ฅผ Discrete Space๋ผ๊ณ  ์ •์˜ ํ•œ๋‹ค.
  • Continous Space : Action์ด ๋ฌดํ•œํ•œ ๊ฒฝ์šฐ
    • e.g. ์ž์œจ์ฃผํ–‰ ์ž๋™์ฐจ์—์„œ ํ•ธ๋“ค์ด ์›€์ง์ด๋Š” ๋ฐฉํ–ฅ์€ ๊ฐ๋„์— ๋”ฐ๋ผ 0~360๊นŒ์ง€ ๊ฐ€๋Šฅํ•˜๋‹ค.

Rewards and the discounting

๋ณด์ƒ์— ๋Œ€ํ•œ ๋ˆ„์ ํ•ฉ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

โ€ฆ ์ถ”ํ›„์ž‘์„ฑ

https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit1/unit1.ipynb#scrollTo=PAEVwK-aahfx