Я вижу следующее уравнение в разделе « Обучение усилению. Введение », но не совсем следую шагу, который я выделил синим цветом ниже. Как именно получается этот шаг?
Я вижу следующее уравнение в разделе « Обучение усилению. Введение », но не совсем следую шагу, который я выделил синим цветом ниже. Как именно получается этот шаг?
Ответы:
Это ответ для всех, кто задается вопросом о чистой, структурированной математике, стоящей за ней (т.е. если вы принадлежите к группе людей, которые знают, что такое случайная переменная, и что вы должны показать или предположить, что случайная переменная имеет плотность, то это ответ для вас ;-)):
Прежде всего нам нужно иметь, чтобы у Марковского процесса принятия решений было только конечное число L 1
Теорема 1 : Пусть X ∈ L 1 ( Ω )
Доказательство : по существу доказано здесь Стефаном Хансеном.
Теорема 2 : Пусть X ∈ L 1 ( Ω ),
Доказательство :
E [ X | Y = y ]= ∫ R x p ( x | y ) d x (тм. 1)= ∫ R x p ( x , y )p ( y ) dx= ∫ R x ∫ Z p ( x , y , z ) d zp ( y ) dx= ∫ Z ∫ R x p ( x , y , z )p ( y ) dxdz= ∫ Z ∫ R x p ( x | y , z ) p ( z | y ) d x d z= ∫ Z p ( z | y ) ∫ R x p ( x | y , z ) d x d z= ∫ Z p ( z | y ) E [ X | Y = y , Z = z ] d z (тм. 1)
Положим G t = ∑ ∞ k = 0 γ k R t + k
Either the state space is finite (then ∫S=∑S
E[Gt|St=st]=E[G(K)t|St=st]=E[Rt|St=st]+γ∫Sp(st+1|st)E[Gt+1|St+1=st+1]dst+1
and then the rest is usual density manipulation.
REMARK: Even in very simple tasks the state space can be infinite! One example would be the 'balancing a pole'-task. The state is essentially the angle of the pole (a value in [0,2π)
REMARK: People might comment 'dough, this proof can be shortened much more if you just use the density of Gt
Let total sum of discounted rewards after time t
Gt=Rt+1+γRt+2+γ2Rt+3+...
Utility value of starting in state,s
discounted rewards R
Uπ(St=s)=Eπ[Gt|St=s]
=Eπ[(Rt+1+γRt+2+γ2Rt+3+...)|St=s]
=Eπ[(Rt+1+γ(Rt+2+γRt+3+...))|St=s]
=Eπ[(Rt+1+γ(Gt+1))|St=s]
=Eπ[Rt+1|St=s]+γEπ[Gt+1|St=s]
=Eπ[Rt+1|St=s]+γEπ[Eπ(Gt+1|St+1=s′)|St=s]
=Eπ[Rt+1|St=s]+γEπ[Uπ(St+1=s′)|St=s]
=Eπ[Rt+1+γUπ(St+1=s′)|St=s]
Assuming that the process satisfies Markov Property:
Probability Pr
Pr(s′|s,a)=Pr(St+1=s′,St=s,At=a)
Reward R
R(s,a,s′)=[Rt+1|St=s,At=a,St+1=s′]
Therefore we can re-write above utility equation as,
=∑aπ(a|s)∑s′Pr(s′|s,a)[R(s,a,s′)+γUπ(St+1=s′)]
Where;
π(a|s)
Here is my proof. It is based on the manipulation of conditional distributions, which makes it easier to follow. Hope this one helps you.
vπ(s)=E[Gt|St=s]=E[Rt+1+γGt+1|St=s]=∑s′∑r∑gt+1∑ap(s′,r,gt+1,a|s)(r+γgt+1)=∑ap(a|s)∑s′∑r∑gt+1p(s′,r,gt+1|a,s)(r+γgt+1)=∑ap(a|s)∑s′∑r∑gt+1p(s′,r|a,s)p(gt+1|s′,r,a,s)(r+γgt+1)Note that p(gt+1|s′,r,a,s)=p(gt+1|s′) by assumption of MDP=∑ap(a|s)∑s′∑rp(s′,r|a,s)∑gt+1p(gt+1|s′)(r+γgt+1)=∑ap(a|s)∑s′∑rp(s′,r|a,s)(r+γ∑gt+1p(gt+1|s′)gt+1)=∑ap(a|s)∑s′∑rp(s′,r|a,s)(r+γvπ(s′))
What's with the following approach?
vπ(s)=Eπ[Gt∣St=s]=Eπ[Rt+1+γGt+1∣St=s]=∑aπ(a∣s)∑s′∑rp(s′,r∣s,a)⋅Eπ[Rt+1+γGt+1∣St=s,At+1=a,St+1=s′,Rt+1=r]=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γvπ(s′)].
The sums are introduced in order to retrieve a
I am not sure how rigorous my argument is mathematically, though. I am open for improvements.
This is just a comment/addition to the accepted answer.
I was confused at the line where law of total expectation is being applied. I don't think the main form of law of total expectation can help here. A variant of that is in fact needed here.
If X,Y,Z
E[X|Y]=E[E[X|Y,Z]|Y]
In this case, X=Gt+1
E[Gt+1|St=s]=E[E[Gt+1|St=s,St+1=s′|St=s]
From there, one could follow the rest of the proof from the answer.
Eπ(⋅)
It looks like r
Thus, the expectation accounts for the policy probability as well as the transition and reward functions, here expressed together as p(s′,r|s,a)
even though the correct answer has already been given and some time has passed, I thought the following step by step guide might be useful:
By linearity of the Expected Value we can split E[Rt+1+γE[Gt+1|St=s]]
I will outline the steps only for the first part, as the second part follows by the same steps combined with the Law of Total Expectation.
E[Rt+1|St=s]=∑rrP[Rt+1=r|St=s]=∑a∑rrP[Rt+1=r,At=a|St=s](III)=∑a∑rrP[Rt+1=r|At=a,St=s]P[At=a|St=s]=∑s′∑a∑rrP[St+1=s′,Rt+1=r|At=a,St=s]P[At=a|St=s]=∑aπ(a|s)∑s′,rp(s′,r|s,a)r
Whereas (III) follows form:
P[A,B|C]=P[A,B,C]P[C]=P[A,B,C]P[C]P[B,C]P[B,C]=P[A,B,C]P[B,C]P[B,C]P[C]=P[A|B,C]P[B|C]
I know there is already an accepted answer, but I wish to provide a probably more concrete derivation. I would also like to mention that although @Jie Shi trick somewhat makes sense, but it makes me feel very uncomfortable:(. We need to consider the time dimension to make this work. And it is important to note that, the expectation is actually taken over the entire infinite horizon, rather than just over s
At this stage, I believe most of us should already have in mind how the above leads to the final expression--we just need to apply sum-product rule(∑a∑b∑cabc≡∑aa∑bb∑cc
Part 1
∑a0π(a0|s0)∑a1,...aT∑s1,...sT∑r1,...rT(T−1∏t=0π(at+1|st+1)p(st+1,rt+1|st,at)×r1)
Well this is rather trivial, all probabilities disappear (actually sum to 1) except those related to r1
Part 2
Guess what, this part is even more trivial--it only involves rearranging the sequence of summations.
∑a0π(a0|s0)∑a1,...aT∑s1,...sT∑r1,...rT(T−1∏t=0π(at+1|st+1)p(st+1,rt+1|st,at))=∑a0π(a0|s0)∑s1,r1p(s1,r1|s0,a0)(∑a1π(a1|s1)∑a2,...aT∑s2,...sT∑r2,...rT(T−2∏t=0π(at+2|st+2)p(st+2,rt+2|st+1,at+1)))
And Eureka!! we recover a recursive pattern in side the big parentheses. Let us combine it with γ∑T−2t=0γtrt+2
and part 2 becomes
∑a0π(a0|s0)∑s1,r1p(s1,r1|s0,a0)×γvπ(s1)
Part 1 + Part 2
vπ(s0)=∑a0π(a0|s0)∑s1,r1p(s1,r1|s0,a0)×(r1+γvπ(s1))
And now if we can tuck in the time dimension and recover the general recursive formulae
vπ(s)=∑aπ(a|s)∑s′,rp(s′,r|s,a)×(r+γvπ(s′))
Final confession, I laughed when I saw people above mention the use of law of total expectation. So here I am
There are already a great many answers to this question, but most involve few words describing what is going on in the manipulations. I'm going to answer it using way more words, I think. To start,
Gt≐T∑k=t+1γk−t−1Rk
is defined in equation 3.11 of Sutton and Barto, with a constant discount factor 0≤γ≤1
vπ(s)≐Eπ[Gt∣St=s]=Eπ[Rt+1+γGt+1∣St=s]=Eπ[Rt+1|St=s]+γEπ[Gt+1|St=s]
That last line follows from the linearity of expectation values. Rt+1
Work on the first term. In words, I need to compute the expectation values of Rt+1
Eπ[Rt+1|St=s]=∑r∈Rrp(r|s).
In other words the probability of the appearance of reward r
p(r|s)=∑s′∈S∑a∈Ap(s′,a,r|s)=∑s′∈S∑a∈Aπ(a|s)p(s′,r|a,s).
Where I have used π(a|s)≐p(a|s)
Eπ[Rt+1|St=s]=∑r∈R∑s′∈S∑a∈Arπ(a|s)p(s′,r|a,s),
as required. On to the second term, where I assume that Gt+1
Eπ[Gt+1|St=s]=∑g∈Γgp(g|s).(∗)
Once again, I "un-marginalize" the probability distribution by writing (law of multiplication again)
p(g|s)=∑r∈R∑s′∈S∑a∈Ap(s′,r,a,g|s)=∑r∈R∑s′∈S∑a∈Ap(g|s′,r,a,s)p(s′,r,a|s)=∑r∈R∑s′∈S∑a∈Ap(g|s′,r,a,s)p(s′,r|a,s)π(a|s)=∑r∈R∑s′∈S∑a∈Ap(g|s′,r,a,s)p(s′,r|a,s)π(a|s)=∑r∈R∑s′∈S∑a∈Ap(g|s′)p(s′,r|a,s)π(a|s)(∗∗)
The last line in there follows from the Markovian property. Remember that Gt+1
γEπ[Gt+1|St=s]=γ∑g∈Γ∑r∈R∑s′∈S∑a∈Agp(g|s′)p(s′,r|a,s)π(a|s)=γ∑r∈R∑s′∈S∑a∈AEπ[Gt+1|St+1=s′]p(s′,r|a,s)π(a|s)=γ∑r∈R∑s′∈S∑a∈Avπ(s′)p(s′,r|a,s)π(a|s)
as required, once again. Combining the two terms completes the proof
vπ(s)≐Eπ[Gt∣St=s]=∑a∈Aπ(a|s)∑r∈R∑s′∈Sp(s′,r|a,s)[r+γvπ(s′)].
UPDATE
I want to address what might look like a sleight of hand in the derivation of the second term. In the equation marked with (∗)
If that argument doesn't convince you, try to compute what p(g) is:
p(g)=∑s′∈Sp(g,s′)=∑s′∈Sp(g|s′)p(s′)=∑s′∈Sp(g|s′)∑s,a,rp(s′,a,r,s)=∑s′∈Sp(g|s′)∑s,a,rp(s′,r|a,s)p(a,s)=∑s∈Sp(s)∑s′∈Sp(g|s′)∑a,rp(s′,r|a,s)π(a|s)≐∑s∈Sp(s)p(g|s)=∑s∈Sp(g,s)=p(g).
As can be seen in the last line, it is not true that p(g|s)=p(g). The expected value of g depends on which state you start in (i.e. the identity of s), if you do not know or assume the state s′.