Assignment #1 (Bipedal walking)

UPDATES & NOTICES
- [21:37 04/17] Please note that you are free to modify the RL and model parameters (e.g., learning rate, batch size, network architecture, etc.) as needed.
- [20:13 04/16] [Tip] Even if the reward graph shows a plateau for a while, it may start increasing again after some time (might be less than 1~2 hours for homework #1). However, if the plateau continues for too long, we recommend tuning your reward function.
- [16:00 04/15] [Tip] For 1-2, during training, try sampling more frequently from target velocities where the policy often fails instead of the uniform sampling— this helps the model learn from difficult cases.
  - ex) Adaptive Sampling, Beta distribution, …
- [00:40 04/11] 1–2 have been updated with more detailed explanations (red and blue highlights).
  - Summary: Evaluation is based on the velocity along the x-axis, and you are responsible for explicitly including the target speed in the observation.
- [15:01 04/10] The submission method has been updated (e-mail → etl)
- [20:17 04/06] It is allowed to use separate controllers (policies), custom observations, custom rewards, termination conditions, and so on for 1-1, 1-2, and extra.
- [16:56 04/05] A time limit has been added to the extra problem.
- [16:46 04/05] [Notice] You are allowed to change the termination condition. (For example, one of the termination conditions is based on the character’s torso angle (terminated = True). But this limit could be too strict when the character tries to run fast or jump. Try adjusting it so the character can lean forward more if needed). During evaluation, the episode is only terminated when the time limit is reached. (20 seconds for 1-1, 10 seconds for 1-2, and 20 seconds for the extra problem.)
- [16:11 04/05] Skeleton code updated (Adjusted observation space to match the dimension of "custom observation" for better compatibility with the learning library in [‎custom_walker2d](<https://github.com/snumrl/2025_SNU_HumanMotion_HW/commit/1366655abec0c62c0e36105ee8b9988cd7fe65fa#diff-570ca1be185388a495f6d8698818ec00ccd8d690783299baa118fd756ace787c>).py)
- [09:50 04/02] Skeleton code updated (Passed the vel flag when calling CustomEnvWrapper in render.py / learner.py)
- [09:31 04/02] Update 1-1 (1000 steps → 1000 environment steps (= simulation steps / 10))
- [16:38 04/01] The skeleton code link is available.

(https://gymnasium.farama.org/environments/mujoco/walker2d/)

In this homework, you will train a simple character to walk in a simulated environment. More specifically, your goal is to teach the character to walk on flat terrain while matching a target walking speed. The objective is not only to move the character forward, but also to generate a natural walking motion that closely follows the given speed.

Skeleton Code: https://github.com/snumrl/2025_SNU_HumanMotion_HW.git

The skeleton code is based on the Walker2D environment from Gymnasium, and the default reinforcement learning algorithm provided is from Stable-Baselines3. Through this assignment, you will gain hands-on experience in controlling a simulated character and become familiar with applying reinforcement learning in physical simulation.

1-1. Train a Walking Controller on Flat Ground (40%)

Train a controller that enables the character to walk forward and reach the end of the grid within 20 seconds (1000 environment steps (= simulation steps / 10)).
You are allowed to use a custom observation or reward function if it helps improve learning efficiency and produce a more natural walking motion compared to the default setup.
Bonus (+5%): Extra points will be awarded for especially natural-looking movement.

1- 2. Train with Target Velocity (40%)

Use the “--vel” flag.
The environment provides a target x-axis velocity between 1.0 and 5.0 m/s which needs to be included in the observation. Train your controller to match the target velocity as closely as possible throughout the episode.
Only the average velocity along the x-axis of the COM is considered when comparing it to the target velocity, where average velocity is computed as {total distance} / {total time}.
The target velocity is given when the environment is reset, and the average velocity during the episode is measured based on the position of the center of mass (COM).
Evaluation is based on the average x-axis velocity over a 10-second episode, compared to the target velocity. 20 tests will be conducted. For each test, a score is awarded if the difference between the average velocity and the target velocity is within 10% of the target velocity.
You may define your own reward functions and observation features to improve performance.
You are free to implement any method you wish — but you must not directly modify the character’s pose, velocity, or acceleration in the simulation. All control must be done strictly through the action passed into env.step(action).
Bonus (+5%): Extra points will be awarded for especially natural-looking movement.