Published on

Sign Language to Speech

Authors

Introduction

In this project, we will create a machine learning model that is able to recognize ASL signs and turn them into speech using the Google Text To Speech (gTTS) module. In addition, the model will be able to recognize the middle finger and censor them when detected. We will be splitting the project up into 3 parts:

  1. Data Collection
  2. Model Training
  3. Final Product

Background

Where do we get data? This page has compiled a list of dataset aggregators which you may find useful, but for this project, we will be collecting our own dataset! One may ask, "Why go through all the trouble to collect it yourself?", well, here's why:

MediaPipe Hands

Hand Landmarks

Image by Google via https://google.github.io/mediapipe/solutions/hands

This is why I prefer not to use online datasets, as the saying goes, "Garbage in, garbage out", we would have to ensure that the points are plotted correctly for each individual training image, and by doing so, may actually be more time consuming than collecting the data ourselves, furthermore, for this project, instead of learning the handsigns through images, we will be learning through the information of the 21 coordinates. This will provide a few benefits:

  1. Faster learning

  2. Takes up less storage space on your machine

  3. Idependent of background noise

  4. Less data cleaning required

Reason:

Imagine learning from a 200px * 200px image, each image will be a matrix of size 200 * 200 * 3 (RGB value), totaling to 120,000 datapoints! Of the 120,000 datapoints, not all are useful, for example, it will include information of the background which may skew the actual predictions and thus, an additional step of removing the background may be required. However, if we were to leverage on MediaPipe Hands, each sign will only be a matrix of size 21 * 3 (X,Y,Z coodinates) and all 63 datapoints are information rich and independent of background noise. With that out of the way, lets move on to the first part of the project, data collection.

1. Data Collection

Let's first create a folder for this project. In this folder, there will be 4 files:

  1. "landmarks" folder ➜ To store the .npz files of X and y later

  2. DataCollection.ipynb/py ➜ Codes to collect data will be here

  3. ModelTraining.ipynb/py ➜ Codes to train the model will be here

  4. Final.ipynb/py ➜ Codes to put the model to use will be here

Hand Landmarks

DataCollection.ipynb/py - Let's first import the required modules.

  import cv2 
  import mediapipe as mp
  import numpy as np

OpenCV (cv2) is an open source computer vision and machine learning software library. In this project, we will mainly be using it to capture and read images.

We will be writing a function, reference_from_wrist which takes in 2 parameters, results and hand, which we will cover in the later part of the code. This function is implemented to do 3 things:

  1. Change the coordinates of the wrist to [0, 0, 0] ➜ This will allow our model to learn from the sign's coordinates irregardless of where it is being signed on the screen
  2. Change the remaining 20 coordinates to be referencing from the wrist ➜ Same reason as above
  3. Flip the coordinates along the y-axis if it is a left hand ➜ By leveraging on MediaPipe's handedness detection, we can effectively reduce the number of training data needed by half as we do not need to differentiate between left and right hand datapoints.
  def reference_from_wrist(results, hand): 

    i = 0
    landmarks = []
    references = [0,0,0]
    for cords in results.multi_hand_landmarks[0].landmark:
      if i == 0:
        references[0] = cords.x
        references[1] = cords.y
        references[2] = cords.z
        landmarks.append([0,0,0])

      else:
        if hand == "Right":
          landmarks.append([cords.x - references[0],cords.y - references[1],cords.z])
        else:
          landmarks.append([references[0] - cords.x,cords.y - references[1],cords.z])
      i += 1
    return landmarks

Next, we will be declaring a few variables SIGNS ➜ List down the signs you want to capture, in this case, we will be spelling out "I LOVE ML", with the space sign to indicate when to call upon gTTS. We will also be learning to recognize the middle finger

CAPTURES ➜ The number of "images" we will be capturing for each sign, in this case, 50 will suffice

image_count ➜ To keep track of the number of "images" left to capture

current ➜ Serves as an index to keep track of which sign we are currently at

X and y ➜ X to store the data points and y to store the labels of the point

  #Rename to sign name
  SIGNS = ["I","L","O","V","E","M","space","middle_finger"]
  CAPTURES = 50
  image_count = CAPTURES
  current = 0

  X = []
  y = []

The following code can be found from MediaPipe Hands. I have added and modified a few lines of code to make it work for our project. There are a few variables you can play with, for instance, "max_num_hands" is a variable which determines how many hands can be detected when the program is being run and "min_detection_confidence" and "min_tracking_confidence" will determine how confident the model has to be before plotting the coordinates.

In the code, by pressing space, we will pass the results and the handedness of the processed hand image from mediapipe to the function, reference_from_wrist. "results" contains quite a few things, however, we just need the coordinates and the handedness.

After retrieving the new landmark coordinates, we will append it and label of the current capture into the lists X and y. The capturing process will then be terminated after all the signs has been taken or the "q" button has been pressed. This will then save the lists as Numpy arrays in the landmarks folder as .npz files to be loaded later on.

  mp_drawing = mp.solutions.drawing_utils
  mp_hands = mp.solutions.hands
  # For webcam input:
  cap = cv2.VideoCapture(0)
  with mp_hands.Hands(
      max_num_hands=1,
      min_detection_confidence=0.8,
      min_tracking_confidence=0.8) as hands:
      while cap.isOpened():
          success, image = cap.read()
          if not success:
              print("Ignoring empty camera frame.")
          # If loading a video, use "break" instead of "continue".
              continue

          # Flip the image horizontally for a later selfie-view display, and convert
          # the BGR image to RGB.
          image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
          # To improve performance, optionally mark the image as not writeable to
          # pass by reference.
          image.flags.writeable = False
          results = hands.process(image)

          # Draw the hand annotations on the image.
          image.flags.writeable = True
          image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
          annotated_image = image.copy()
          if results.multi_hand_landmarks:
              hand = results.multi_handedness[0].classification[0].label

              for hand_landmarks in results.multi_hand_landmarks:
                  mp_drawing.draw_landmarks(
                      annotated_image, hand_landmarks, mp_hands.HAND_CONNECTIONS)

          k = cv2.waitKey(1)
          title = "Taking  points for image {0} {1}".format(SIGNS[current],str(image_count))
          cv2.imshow("", annotated_image)
          cv2.setWindowTitle("", title)

          #if space pressed
          if k%256 == 32:
              if results.multi_hand_landmarks != None:
                  
                  new_landmarks = np.array(reference_from_wrist(results, hand))
                  X.append(new_landmarks)
                  y.append(SIGNS[current])

                  cv2.destroyAllWindows()
                  image_count -= 1
                  if image_count == 0:
                      current += 1
                      image_count = CAPTURES
          if cv2.waitKey(5) & 0xFF == ord("q") or current == len(SIGNS):
              np.savez("landmarks/X", np.array(X))
              np.savez("landmarks/y", np.array(y))
              break
  cap.release()
  cv2.destroyAllWindows()

2. Model Training

After collecting the data needed, we will have to train the model to recognize the signs.

ModelTraining.ipynb/py - Import the required modules

  import numpy as np
  from sklearn.preprocessing import LabelEncoder
  from tensorflow.keras import utils
  from sklearn.model_selection import train_test_split

  from keras.models import Sequential
  from keras.layers import Dense, Flatten
  from  tensorflow.keras.optimizers import Adam

Next, we will import the numpy arrays we have previously saved and split them into train and test sets.

  le = LabelEncoder()
  X = np.load("landmarks/X.npz")["arr_0"]
  y = np.load("landmarks/y.npz")["arr_0"]
  y = le.fit_transform(y)
  y = utils.to_categorical(y)
  x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.1,random_state=42)

Now, we will be creating the machine learning model. First, we declare it as a sequential model where we can build it layer by layer. Because our input is a matrix, we will first flatten it into a 1D array as our output layer will also be a 1D array. Next, we add 2 dense layers with 50 neurons each setting "relu" to be the activation function. A dense layer is when all the neurons in the current layer are connected to the neurons in the previous layer. The last output layer will also be a dense layer with 8 nodes, 1 for each class, having "softmax" as the activation function.

One may ask, how do we choose what activation functions to use, while there is no right answer, this article may give you a better idea to what kind of activation function is more suitable in what scenarios.

  model = Sequential()
  model.add(Flatten(input_shape=(21, 3) ))
  model.add(Dense(50, activation="relu"))
  model.add(Dense(50, activation="relu"))
  model.add(Dense(len(le.classes_), activation="softmax"))
  model.summary()

We will then compile and start training the model.

  model.compile(optimizer="adam",
  loss = "categorical_crossentropy",
  metrics = ["accuracy"])

As we are training with very lightweight data, this process should not take long.

Training Process

The model performs very well with the train and test sets, but what about live testing?

Proceed to save the model so we can reuse it in the future without training it again.

  model.save("my_model")

3. Final Product

Final.ipynb/py - We can now test out the model! Let's import the required modules!

  import cv2
  import mediapipe as mp
  import numpy as np
  from sklearn.preprocessing import LabelEncoder
  import keras
  import threading
  from gtts import gTTS
  import playsound
  import os

Import the classes from the y.npz and load the model we have just trained!

  le = LabelEncoder()
  model = keras.models.load_model("my_model")
  y = np.load("landmarks/y.npz")["arr_0"]
  y = le.fit_transform(y)

We will now need to create 4 functions for the program to work.

  1. reference_from_wrist ➜ This is the same function we previously wrote in "DataCollection.ipynb" to coonvert landmark coordinates

  2. handsign_prediction ➜ We will throw in the coordinates collected real-time through our webcam input into the function. The function will then call on the model to predict the sign that is currently shown. This function will return the predicted sign and its probability

  3. text_to_speech ➜ The function will convert the text received into audio

  4. draw_censor ➜ This function will be called if the middle finger is detected. It take in the frame of when the middle finger is detected and the results (contains the coordinates) that MediaPipe gives and draw a black square to censor it using OpenCv's rectangle function and return the censored frame

  def reference_from_wrist(results, hand):
    i = 0
    landmarks = []
    references = [0,0,0]
    for cords in results.multi_hand_landmarks[0].landmark:
      if i == 0:
        references[0] = cords.x
        references[1] = cords.y
        references[2] = cords.z
        landmarks.append([0,0,0])
  
      else:
        if hand == "Right":
          landmarks.append([cords.x - references[0],cords.y - references[1],cords.z])
        else:
          landmarks.append([references[0] - cords.x,cords.y - references[1],cords.z])
      i += 1
    return landmarks

  def handsign_prediction(landmarks, show=True):
    pred = model.predict([landmarks], batch_size=1)[0]
    prob = pred[np.argmax(pred)]
    return le.classes_[np.argmax(pred)],prob

  def text_to_speech(*text):
    text = "".join(text)
    text_speech = gTTS(text=text, lang = "en")
    filename = "text_to_speech.mp3"
    text_speech.save(filename)
    playsound.playsound(filename)
    os.remove(filename)
    return

  def draw_censor(results, image):
    smallest_x = 0
    smallest_y = 0
    biggest_x = 0
    biggest_y = 0
    i = 0

    image_height, image_width, _ = image.shape

    if not results.multi_hand_landmarks == None:
      for cords in results.multi_hand_landmarks[0].landmark:
        x = cords.x * image_width
        y = cords.y * image_height
        if i == 0:
          smallest_x = int(x)
          smallest_y = int(y)
          biggest_x = int(x)
          biggest_y = int(y)
        else:
          if x < smallest_x:
            smallest_x = int(x)
          if y < smallest_y:
            smallest_y = int(y)
          if x > biggest_x:
            biggest_x = int(x)
          if y > biggest_y:
            biggest_y = int(y)
        i += 1
      
    image = cv2.rectangle(image,(smallest_x-50, smallest_y-50), (biggest_x+50, biggest_y), (0,0,0),-1)
    return image

Now we edit MediaPipe's code so that it can store the spelling of the words and display what is predicted. We will recognize it as part of the intended spelling if a sign remains the same for 8 frames. When space is gestured, we will then pass the word to the "text_to_speech" function which turns it into audio.

We will declare a few variables:

  1. word ➜ Stores the letters spelled

  2. current_word ➜ Used to check if the sign remains the same

  3. STORE_FRAME ➜ Indicates how many frames are needed before a sign is registered, in this case, it is 8

  4. frame➜ Indicates how many frames has passed, if 8, it will register the sign as part of the spelling.

  5. pred ➜ Current prediction of the sign shown

  6. prob ➜ Probability that the sign shown is predicted correctly

  word = ""
  current_word = ""
  STORE_FRAME = 8
  frame = 0
  pred = ""
  prob = 0

And finally, the code for live testing, after all, we are not interested in theoretical accuracy are we?

  mp_drawing = mp.solutions.drawing_utils
  mp_hands = mp.solutions.hands

  # For webcam input:
  cap = cv2.VideoCapture(0)
  with mp_hands.Hands(
    max_num_hands=2,
    min_detection_confidence=0.8,
    min_tracking_confidence=0.8) as hands:
    while cap.isOpened():
      success, image = cap.read()
      if not success:
        # If loading a video, use "break"instead of "continue".
        continue
      # Flip the image horizontally for a later selfie-view display, and convert
      # the BGR image to RGB.
      image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
      # To improve performance, optionally mark the image as not writeable to
      # pass by reference.
      image.flags.writeable = False
      results = hands.process(image)

      # Draw the hand annotations on the image.
      image.flags.writeable = True
      image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
      annotated_image = image.copy()
      censor_image = image.copy()
      if results.multi_hand_landmarks:
        hand = results.multi_handedness[0].classification[0].label
        new_landmarks = reference_from_wrist(results, hand)
              
        for hand_landmarks in results.multi_hand_landmarks:
          mp_drawing.draw_landmarks(
            annotated_image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
          pred, prob = handsign_prediction(new_landmarks, show = False)
      else:
        pred = ""
        prob = 0
      # describe the type of font 
      # to be used. 
      font = cv2.FONT_HERSHEY_SIMPLEX 
      prediction = "Prediction: "+ str(pred)
      probability = "Probability: "+ str(prob)
      # Use putText() method for 
      # inserting text on video 
      if prob > 0.6:
        cv2.putText(image,  
                    probability,  
                    (50, 80),  
                    font, 0.75,  
                    (0, 255, 0),  
                    2,  
                    cv2.LINE_4) 
        if pred == "middle_finger":
          image = draw_censor(results, censor_image)

        else:
          cv2.putText(image,  
                      prediction,  
                      (50, 50),  
                      font, 0.75,  
                      (0, 255, 0),  
                      2,  
                      cv2.LINE_4)
            if current_word != pred:
              frame = 0
              current_word = pred
            else:
              frame += 1

            if frame == STORE_FRAME:
              if current_word == "space":
                speech = threading.Thread(target=text_to_speech, name="speech", args=word)
                speech.start()
                word = ""
                current_word = ""
                frame = 0
              else:
                word += current_word
                current_word = ""
                              
                  
      cv2.putText(image,  
                  word,  
                  (50, 110),  
                  font, 0.75,  
                  (0, 255, 0),  
                  2,  
                  cv2.LINE_4) 
      cv2.imshow("MediaPipe Hands", image)
      
      if cv2.waitKey(5) & 0xFF == ord("q"):
          break
  cap.release()
  cv2.destroyAllWindows() 

When we sign, the code will make sure that the probability of the sign being correct is > 60% before showing it as a prediction. When "space" is signed, it will then create a thread to call the "text_to_speech" function, turning the word into audio!

Project in motion!

Signing

Signing

Censoring

Censoring

Final Words

I hope you guys found this project interesting and have learnt something. As this is my first machine learning project and article, there may be some mistakes so feel free to leave any feedback or comments regarding anything. Thank you for your time!