Running Khmer ASR with π€Transformers
Hugging Face has a library called transformers (opens in a new tab) which simplifies the way you interact with machine learning models. It is really easy to get started.
There are many available pretrained models available on Hugging Face for tasks like Text Classification, Token Classification, Speech to Text, Text to Speech, Text to Image, etc.
In this blog post, we will be running a model for Khmer Speech to Text (Automatic Speech Recognition) pretrained model by Vitou Phy (opens in a new tab).
I. Installation
You need Anaconda to follow along so make sure you it installed on your system.
1. Create New Anaconda Environment
conda create -n speech2text python==3.8 --yes
conda activate speech2text
2. Install PyTorch
Choose your operating system to install PyTorch.
- Linux with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia --yes
- macOS
# MPS acceleration is available on MacOS 12.3+
conda install pytorch::pytorch torchvision torchaudio -c pytorch
- Windows with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
- Windows
conda install pytorch torchvision torchaudio cpuonly -c pytorch
Check if PyTorch was installed.
import torch
# Check if CUDA is available
print(torch.cuda.is_available())
# => True
3. Install π€ Transformers and others
In this example, we need extra libraries such as
librosa
for loading audio file.pyctcdecode
for decoding the outputs from the model.kenlm
for loading the language model.
pip install transformers pyctcdecode librosa https://github.com/kpu/kenlm/archive/master.zip
II. Inference
I will be using an audio from this URL (opens in a new tab) to test. Make sure you have downloaded it to your current working directory.
wget -O audio.mp3 https://datasets-server.huggingface.co/assets/seanghay/khmer_grkpp_speech/--/seanghay--khmer_grkpp_speech/train/12/audio/audio.mp3
The audio sample rate is 44.1kHz, so we have to resample it to be 16kHz using ffmpeg.
ffmpeg -i audio.mp3 -ar 16000 -ac 1 -acodec pcm_s16le audio_16khz.wav
After that, we can start using it with the model. We will be using pipeline()
function from transformers library. The pipeline()
function is the quickest way to start using machine learning models because it hides abstractions away from us.
from transformers import pipeline
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
# audio sample rate shound be in 16khz
output = pipe("audio_16khz.wav", chunk_length_s=10, stride_length_s=(4, 2))
print(output['text'])
Run this script will take quite sometime because it will download the model weights, processors, and tokenizers from the hub. Once it's downloaded, it will be cached for later use.
Predicted Output
ααααα ααΆα α ααα’αα αααα αΆα ααΈ ααα ααΆα α§αααα αααΈ αα αα·ααααΆ α’ααααααααααΆααΆα ααα αα αα»αααααααααα ααΌ αα·ααα αα ααααΆααΆαααααΆαα±αααα α αΆαα ααΎ αααα αααααα αα·α ααα ααΆα ααΆα αααααα ααααα·α αααΈα’α ααααα αα½αααααααα’αα·ααΆα αα ααα α’αα·ααΆα ααΆαααΆααΈ ααααααα αα αααα αααα ααΈ αααα αα½α αα ααααΆαααααΆα ααΈα ααΆαα αααα αα½α ααα α§ααααααααΈα±ααααα ααααααααααααΆ ααΆα αα αα ααΆαα’αα»ααα αΆαα ααΎ αααα αααααα ααΆ αααααααΆααΆα αα ααΆαα’ααααααΆαααΆααΈ ααααααα α’α ααααΎα ααα αα αααααΆ ααΆα ααα ααΆα ααΉαααΆα αααα»α ααΆαααΆα αααα±αα»αααα α·α ααΆαααΆααΈ ααααααα α αα ααααααααα·ααΆαα αα αααΆαα ααα αααααα α’ααα αα αΆ ααΆααΆαααΈααα αα α αααααααΆα ααααα ααΌα ααΆαααα ααΆααΆ ααΆαααΆααΈ ααααααα αα α αα ααΌα αααααΆ αααααα αααα»α ααΌαα·ααΆααααα ααΆαααΆααΈ ααααααα αα·α ααΆα αααα αα αααΌααΌ ααα ααΈ α ααα½α ααα ααΈα ααααΏα ααααΆαα ααα α₯ααααααααααααΎ ααΆα α ααα ααααΆα αα αα·ααΆαααΆα ααΈ ααΆαααΆα ααΈ αααααααα ααααα»α αααΆαααααα·ααΆα αα ααΌα ααΆ ααααα½α αα»α ααΆαα·ααΆα ααΆα ααααααΆααααα ααα α’ααααα αα ααΆαααΆααΈ ααααααα ααΆα α α αααα»α α‘αΎα ααΎααααΈ α’αααΆαααΆα α±αα αααααΆ αααααα α αΌα αα½α ααΆαααααΈ ααααΆ αααααΆα ααα ααααΆαα ααΆα αααα ααΆααααααΆα αα ααααΊ ααΌααα ααα ααααΆααα½α αααα»α ααααΉαααα·ααΆα αα αα ααααααααααα ααΆα αααα α αΌα αα αα αααα»α ααααΆα αα αα·α αααα»α ααα ααΎα α‘αΎα αα½α α±αα ααΆαααα
Ground truth
αααααααΆαα ααα’α»ααααα αΆαααΈ αααααΆαα§ααααααααΈαα αα αα»ααΆ α’ααααααααααΆααΆααα αααααααααααααΌαα·ααα αααααααΆααΆα ααααΆαα’αΆαα»αα αααααΎαααααααααα αα·ααααααΆαααΆαααααααααΆααα·α αα ααΈα―αα§αααα αα½α ααααα α’αα·ααΆααααααα’αα·ααΆαααΆαααΆααΈααααααα αα ααααααααααΈα’α€ αααα»αααα ααααΆαα’α α’α‘ αααα§ααααααααΈααα―α ααααα αααααΆα αααααααΆααΆααα ααααΆαα’αΆαα»αα αααααΎαααααααααα ααΆαααααααΆααΆαααααΆαα’αΆαα»αα αααααΆαααΆααΈααααααα α’αααααΎαααα αααααααΆααΆααα ααΆαααΉαααΆα αααα»αααΆαααΆαααααΆαα’αΆαα»αα αααααΆαααΆααΈααααααα α ααααααα·ααααα·ααΆα α αααααΆα ααααααααα α’ααααα αΆααααΆαααΈαααα α αα»α ααα ααΆααααααααΌα ααΆααααααΆααΆααΆαααΆααΈααααααα αα α ααααΌααααααΆαααααααααα»αααΌαα·ααΆαααααααΆαααΆααΈααααααα αα·αααΆααααα αααααΌααΌαααααΈα ααα½αα‘α’ααααΏα ααααΆααααα α§ααααααααααα ααααΎααΆαα αΆααααααΆαααΌααα·ααΆαααΆα α£ααΆαααΆα α£αα»α ααααααααα»αααΆαααΆααααΆαα·ααΆα ααααΌα ααΆ ααααα½ααα»ααΆαα·ααΆα ααΆααααααΆααααααα αααα’αΆαα»αα αααααΆαααΆααΈααααααα ααΆαα αααααα‘αΎα ααΎααααΈα’αααΆαααΆαα²αααααααΆααααααα αΌααα½αααΆαααααΈααααΆ αααααΆα αααααααΆαα ααΆαααααααΈαααΆαααΆα ααΌαααααΊ ααΌααΈα α‘α© αααα»αααααΉαααα·ααΆααααα αααα α’α αα»αααα αααααΆαααααα αΌααα αααααααααααααΆααα αα·ααααα»αααααααΎαα‘αΎαα‘αΎααα½αα²ααααΆαααα
III. Wrap up
If you have questions or got issues while installing these, you can reach me on Twitter (opens in a new tab) or email me at seanghay.dev@gmail.com
2023 Β© Seanghay Yath