Physical adversarial textures that fool visual object tracking
Rey Reza Wiyatno Rey Reza Wiyatno
September 5 12 min

Physical adversarial textures that fool visual object tracking

While adversarial attacks have been proven effective in digital domains, few articles have shown the potency of such approaches in the physical world. Recent advances like the adversarial eyeglasses (Sharif et al., 2016), patch (Brown et al., 2017), and turtle (Athalye et al., 2018) generate physically realizable adversaries but have only been shown to work in an image classification task. In object detection, there are other approaches such as the work by Eykholt et al. (2018). In contrast, we focused our efforts on the applications of adversarial examples in robotics.

For some primers on the topic of adversarial machine learning, here are two articles from a few months ago on adversarial attacks and defenses.

In our paper, “Physical Adversarial Textures that Fool Visual Object Tracking”, we propose Physical Adversarial Textures (PAT) —  the first physically realizable adversarial attack for regression-based object tracking models. In particular, this realization attacked GOTURN (Held et al., 2016), one of the well-known regression-based trackers.

We trained a GOTURN tracker on human targets, where the model predicts the location of the target on a subsequent frame given its location in a preceding frame. Our goal was to generate adversarial textures that confused the tracker when displayed in the physical world (on a TV or printed posters).

We also demonstrate a case study in which the adversarial textures were used to fool a person-following drone algorithm that relies solely on its visual input. We used posters for the attack because they are one of the simplest forms of displaying information and could be a realistic attack vector in the real world. An attacker could place the adversarial textures on a wall like graffiti, and they could disrupt object-tracking algorithms while not appearing suspicious to the average person.

We perform all of our attacks in the Gazebo simulation software and demonstrate simulations to real-world transferability. The use of simulation is beneficial since it provides us with any types of labelling needed to perform the attack. In our case, our goal is to only update the texture of the poster. This means that we need to know the location of the poster within an image, which can be obtained for free from the simulator. This emphasizes the practicality of our method since real-world visual tracking data with a textured poster and its location within the image can be hard to obtain.

Know the victim: what is GOTURN?

GOTURN is an object tracking model (i.e., a model to predict the location of an object in an image in the current frame, given the object position in the previous frame) that is powered by convolutional neural networks.

Here’s how GOTURN works: given a start frame (i.e., “previous frame” in the figure below) and the location of the object that we wish to track (i.e., a bounding box), we crop the image at that location to let the model know what to track. Then, given a current frame, we also crop the frame at the location defined by the bounding box from the previous frame. The assumption made here is that the motion of the tracked object between two consecutive frames is small, and thus the object should still exist within this bounding box. Optionally, one may choose to crop the images as a region that is larger than the given bounding box. Throughout this article, we refer to the image cropped from the previous and current frame as the template and search region, respectively.

GOTURN is an object tracking model powered by convolutional neural networks.
Illustration of GOTURN

Once we have these two cropped frames, we pass them through convolutional layers to extract useful features from both the template and search region. These features are then concatenated before being passed to fully-connected layers that regress the location of the template in the search region. The output of the network predicts the location of the target in the search region coordinate frame. During test time, once we have the predicted bounding box, we convert this bounding box back to the original image coordinate frame, set the current frame to be the “previous frame,” wait till the next frame arrives (which will be the new “current frame”), then use this bounding box to perform the cropping.

Our Method

In general, adversarial attack techniques try to find an input that optimizes loss or objective functions, which may vary depending on the goal of the attacker. For example, when generating an adversary in a classification task, the attacker may choose to find an input to the target model that maximizes cross-entropy loss while preserving perceptual similarity between adversarial and non-adversarial examples. However, the generated adversaries are usually not invariant to various transformations (e.g., the adversarial image may cease to be adversarial after being slightly rotated).

Our attack is based on the Expectation over Transformation framework (Athalye et al., 2018), whose objective is to find an adversarial example that minimizes the expected loss over various transformations, with the hope to eventually produce an adversary that is robust to these transformations. In our experiments, we assume having white-box access to the victim model.

The figure below illustrates our method. Starting from an initial texture image that is projected onto a poster within a simulated scene, the attacker first generates a scene where the object to be tracked is placed near the texture. A camera can then be placed in the simulation environment to take a picture such that the texture and tracked object are both within the camera frame. This first image is the “previous frame.”

To get the “current frame”, we can apply small random motion to the camera and target object to simulate movement while ensuring that both the texture and the target object are still within the frame. Since we do everything in simulation, we have the privilege of getting the location (i.e., bounding box) of the tracked object in the frame.

In our experiments, transformations that we randomize include camera pose, object pose, lighting, background (i.e., simulation environment), and the tracked object (e.g., various people and robot models). We chose to randomize the target because we want the attack to work without requiring specific knowledge of the tracked object. More concretely, we would like to fool the tracker, regardless of whether Alice or Bob is being tracked at any given instant.

Illustration of our method of fooling visual object tracking.
Illustration of our method (Wiyatno & Xu, 2019)

We repeat these processes to get several consecutive frame pairs to form a minibatch. We then pass this minibatch to the GOTURN model and calculate the expected loss that we define (more on this in the next section). We can then backpropagate through the GOTURN model and calculate the gradient of the expected loss with respect to the inputs and update our texture using gradient-based optimization methods (e.g., gradient descent).

It is important to note that we only update the texture and not the whole image. Thanks to the simulator, we can figure out which pixels in the image belong to the texture in the simulation, and those that are blocked by the tracked object. The whole process is then repeated for a fixed number of iterations, or until an adversary has been found according to some metrics (e.g., mean intersection over union).

The objective functions

In an image classification task, a successful attack usually means the attacker finds an adversarial example that is misclassified by the target model. So what does it mean to fool an object tracking model? In our work, we consider the attack successful if the tracker eventually loses track of the target and starts tracking the adversary instead. There are many ways to achieve this, and we can play a bit with the dimensions of the bounding box.

We experimented with various non-targeted and targeted losses. We also propose a novel guided family of losses that encourages certain adversarial attributes rather than strict outputs. As such, the guided approach acts like a middle-ground between non-targeted and targeted attacks. Furthermore, we also experimented with hybrid losses, which are weighted linear combinations of different losses. Below are the definitions of the various losses that we use:

  • non-targeted (nt) loss: aims to increase GOTURN’s training loss
  • targeted-shrink (t-) loss: aims to shrink and move the predicted bounding box towards the bottom-left corner of the search area
  • targeted (t=) loss: aims to lock the predicted bounding box to the exact target location in the previous frame
  • targeted-grow (t+) loss: aims to grow the predicted bounding box to the maximum size of the search area
  • guided-shrink (ga-) loss: encourages the area of the predicted bounding box to shrink from the ground-truth
  • guided-grow (ga+) loss: encourages the area of the predicted bounding box to grow from the ground-truth

Note that the above losses are examples, other losses are possible. Another example of a guided loss would be maximizing or minimizing the norm of the output. Furthermore, similar to other attacks that aim to preserve perceptual similarity between the non-adversarial and adversarial examples, we also attempted to add a perceptual similarity constraint to the objective function. However, this is optional and one may or may not choose to include it depending on the goal of the attack.

Visual servoing

In the field of robotics, visual servoing is a method to control a robot using visual feedback. Visual servoing has been used in many real-world applications. For example, one of the features offered by many aerial drone manufacturers today is the autonomous person-following feature. One of the possible implementations of person-following technology is to use the bounding box from tracking models such as GOTURN as the feedback to the controller. The controller could aim to maintain the size of the bounding box and centre its location by controlling the robot’s actuators.

In our experiments, we used the Parrot Bebop 2 and implemented a PID controller to follow the target. To smoothen the bounding box, we applied an exponential filter before passing it as an input signal to the controller, which adds an additional challenge in attacking the whole servoing pipeline.


We first evaluate our adversarial textures in the simulation. As we see below, the tracker successfully tracks the target when it moves in front of a randomly generated texture. Yet when it sees the generated adversarial texture, the tracker quickly loses the target and instead start locking onto the adversarial texture.

Synthetic sample evaluation. Performance of the tracker when facing randomly generated and adversarial textures.
Synthetic sample evaluation. We compare the performance of the tracker when facing randomly generated and adversarial textures. Here, the green bounding box represents GOTURN prediction, while the purple box represents the ground-truth.
PR2 robot simulated attack.
Another simulated attack where the target is the PR2 robot.

We also evaluated the generated adversary in the real world, both when the camera is stationary and moving (i.e., servoing). Below are some examples of both the stationary and servo runs in an indoor environment. In these experiments, all the textures are displayed on a TV screen.

Real life indoor stationary run.
Indoor stationary run.
Real life indoor stationary run part 2.
Another indoor stationary run tracking a different target.
Real life indoor stationary run part 3.
Another indoor stationary run tracking another different target.
Real life indoor servo (moving drone camera) run.
Indoor servo run . Notice how the camera (a drone) moves to follow the bounding box.

Below are some of the experiments that we conducted in an outdoor environment. Notice how the poster is now printed instead of displayed on a TV screen.

Real life outdoor stationary run.
Outdoor stationary run.
Real life outdoor servo (moving drone camera) run.
Outdoor servo run.

In addition, we also found a performance difference between using non-targeted, targeted, guided, and hybrid losses. Guided losses typically achieved faster convergence, possibly due to its flexible nature, at the cost of weakened adversarial strength after more iterations when compared to non-targeted and targeted losses. When evaluating hybrid losses, we found that some combinations of different losses provide benefits, while others do not. For example, the combination of (nt) and (t=) losses results in better overall performance compared to only using (nt) or (t=) variants alone.

Finally, we conducted an ablation study of EOT conditioning variables to find which transformation has the most impact in generating robust adversarial textures. This is akin to domain randomization approaches in sim2real research  —  not all domains are created equal. While we found variations in lighting, camera pose, and poster pose to be impactful, other variables such as background, target pose, and target appearances are not as critical. Please see our paper for quantitative support.


In this work, we proposed a system to generate Physical Adversarial Textures (PAT), and successfully generated adversarial textures that fool a popular object tracking model. We further demonstrate that these attacks are physically realizable even though they were generated purely from simulations.

There are still some limitations with PATs. For example, our PATs may not work too well when there is a high amount of specularities. Thus, devising adversaries that are robust to specularities is an interesting future research direction.

We introduce the notion of guided loss that offers faster convergence at the cost of adversarial strength, which can be beneficial if convergence speed is an important attack criterion. Finally, we studied the effect of various different EOT conditioning variables so that the attack can be performed more efficiently.

Our goal is to raise awareness that current vision-based tracking systems are vulnerable to adversarial examples. We recommend the integration of auxiliary sensors such as GPS or IMU for safety. As we show how adversarial examples can negatively impact real-world tracking implementations, we hope that other researchers explore the weakness of other robotic systems to adversarial examples in order to develop more robust approaches.

Thanks to Anqi Xu, Peter Henderson, and Minh Dao for valuable comments and illustrations.