In the following video, we demonstrate how the model might be used. The training code and models as well as the web demo source code is available on GitHub. You can try our experimental demo right now! By default, the demo acts as a sign language detector. The sign language detection demo takes the webcam’s video feed as input, and transmits audio through a virtual microphone when it detects that the user is signing. Because video conferencing applications usually detect the audio “volume” as talking rather than only detecting speech, this fools the application into thinking the user is speaking. When the sign language detection model determines that a user is signing, it passes an ultrasonic audio tone through a virtual audio cable, which can be detected by any video conferencing application as if the signing user is “speaking.” The audio is transmitted at 20kHz, which is normally outside the hearing range for humans. This demo leverages PoseNet fast human pose estimation and sign language detection models running in the browser using tf.js, which enables it to work reliably in real-time. We developed a lightweight, real-time, sign language detection web demo that connects to various video conferencing applications and can set the user as the “speaker” when they sign. Once we had a functioning sign language detection model, we needed to devise a way to use it for triggering the active speaker function in video conferencing applications. (1) Extract poses from each frame (2) calculate the optical flow from every two consecutive frames (3) feed through an LSTM and (4) classify class. Using a single layer LSTM, followed by a linear layer, the model achieves up to 91.5% accuracy, with 3.5ms (0.0035 seconds) of processing time per frame.Ĭlassification model architecture. To generalize the use of context, we used a long-short-term memory (LSTM) architecture, which contains memory over the previous timesteps, but no lookback. By including the 50 previous frames’ optical flow as context to the linear model, it is able to reach 83.4%. This baseline reached around 80% accuracy, using only ~3μs (0.000003 seconds) of processing time per frame. As a naïve baseline, we trained a linear regression model to predict when a person is signing using optical flow data. To test this approach, we used the German Sign Language corpus (DGS), which contains long videos of people signing, and includes span annotations that indicate in which frames signing is taking place. The optical flow is then normalized by the video’s frame rate before being passed to the model. Each pose is normalized by the width of the person’s shoulders in order to ensure that the model attends to the person signing over a range of distances from the camera. We use these landmarks to calculate the frame-to-frame optical flow, which quantifies user motion for use by the model without retaining user-specific information. This reduces the input considerably from an entire HD image to a small set of landmarks on the user’s body, including the eyes, nose, shoulders, hands, etc. To reduce the input dimensionality, we isolated the information the model needs from the video in order to perform the classification of every frame.īecause sign language involves the user’s body and hands, we start by running a pose estimation model, PoseNet. To enable a real-time working solution for a variety of video conferencing applications, we needed to design a light weight model that would be simple to “plug and play.” Previous attempts to integrate models for video conferencing applications on the client side demonstrated the importance of a light-weight model that consumes fewer CPU cycles in order to minimize the effect on call quality. Maayan Gazuli, an Israeli Sign Language interpreter, demonstrates the sign language detection system. In “ Real-Time Sign Language Detection using Human Pose Estimation”, presented at SLRTP2020 and demoed at ECCV2020, we present a real-time sign language detection model and demonstrate how it can be used to provide video conferencing systems a mechanism to identify the person signing as the active speaker. In part, due to these challenges, there is only limited research on sign language detection. Enabling real-time sign language detection in video conferencing is challenging, since applications need to perform classification using the high-volume video feed as the input, which makes the task computationally heavy. However, since most video conference applications transition window focus to those who speak aloud, it makes it difficult for signers to “get the floor” so they can communicate easily and effectively. Video conferencing should be accessible to everyone, including users who communicate using sign language. Posted by Amit Moryossef, Research Intern, Google Research
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |