Running Khmer ASR with πŸ€—Transformers

Seanghay Yath,

Hugging Face has a library called transformers (opens in a new tab) which simplifies the way you interact with machine learning models. It is really easy to get started.

There are many available pretrained models available on Hugging Face for tasks like Text Classification, Token Classification, Speech to Text, Text to Speech, Text to Image, etc.

In this blog post, we will be running a model for Khmer Speech to Text (Automatic Speech Recognition) pretrained model by Vitou Phy (opens in a new tab).

I. Installation

You need Anaconda to follow along so make sure you it installed on your system.

1. Create New Anaconda Environment

conda create -n speech2text python==3.8 --yes
conda activate speech2text

2. Install PyTorch

Choose your operating system to install PyTorch.

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia --yes
# MPS acceleration is available on MacOS 12.3+
conda install pytorch::pytorch torchvision torchaudio -c pytorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install pytorch torchvision torchaudio cpuonly -c pytorch

Check if PyTorch was installed.

import torch
 
# Check if CUDA is available
print(torch.cuda.is_available())
# => True

3. Install πŸ€— Transformers and others

In this example, we need extra libraries such as

pip install transformers pyctcdecode librosa https://github.com/kpu/kenlm/archive/master.zip

II. Inference

I will be using an audio from this URL (opens in a new tab) to test. Make sure you have downloaded it to your current working directory.

wget -O audio.mp3 https://datasets-server.huggingface.co/assets/seanghay/khmer_grkpp_speech/--/seanghay--khmer_grkpp_speech/train/12/audio/audio.mp3

The audio sample rate is 44.1kHz, so we have to resample it to be 16kHz using ffmpeg.

ffmpeg -i audio.mp3 -ar 16000 -ac 1 -acodec pcm_s16le audio_16khz.wav

After that, we can start using it with the model. We will be using pipeline() function from transformers library. The pipeline() function is the quickest way to start using machine learning models because it hides abstractions away from us.

from transformers import pipeline
 
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
 
# audio sample rate shound be in 16khz
output = pipe("audio_16khz.wav", chunk_length_s=10, stride_length_s=(4, 2))
 
print(output['text'])

Run this script will take quite sometime because it will download the model weights, processors, and tokenizers from the hub. Once it's downloaded, it will be cached for later use.

Predicted Output

αž€αŸ’αžšαŸ„αž˜ αž€αžΆαžš αž…αž„αŸ’αž’αž›αŸ‹ αž”αž„αŸ’αž αžΆαž‰ αž–αžΈ αž›αŸ„αž€ αž“αžΆαž™ αž§αžαŸ’αžαž˜ αžŸαž“αžΈ αžŸαŸ… αžŸαž·αž€αŸ’αžαžΆ αž’αž‚αŸ’αž‚αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžš αžšαŸ„αž„ αž€αž„ αž‡αž»αžαŸ’αž’αž–αž›αžαŸαž˜αžšαŸˆ αž—αžΌ αž˜αž·αž“αŸ’αž‘ មេ αž”αŸ’αž‡αžΆαž€αžΆαžšαž€αž„αžšαžΆαž‡αž±αžœαžαŸ’αž αž αžΆαžαŸ‹ αž›αžΎ αž•αŸ’αž‘αŸƒ αž”αŸ’αžšαž‘αŸαžŸ αž“αž·αž„ αžŠαŸ„αž™ αž˜αžΆαž“ αž€αžΆαžš αž”αŸ’αžšαž‚αž›αŸ‹ αž•αŸαžšαž€αž·αž…αŸ’αž–αžΈαž’αž…αž™αžŠαž˜αŸ’αž˜ αžƒαž½αž„αŸ‹αžŸαŸ’αžšαŸαž„αž’αž—αž·αž”αžΆαž› αž“αŸƒ αž‚αžŽαŸˆ αž’αž—αž·αž”αžΆαž› αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ αž“αŸ… αžšαžŸαŸ€αž› αžαŸ’αž„αŸƒ αž‘αžΈ αž˜αŸ’αž—αŸƒ αž”αž½αž“ αžαŸ‚ αž‚αŸ†αž—αžΆαŸˆαž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αž˜αŸ’αž—αŸƒ αž˜αž½αž™ αž›αŸ„αž€ αž§αžŠαŸ’αžαž˜αžŸαŸ’αž“αžΈαž±αž€αžšαžŠαžαŸ‹ αžŸαŸ’αžšαŸ€αž„αž˜αŸαž”αž‰αŸ’αž‡αžΆ αž€αžΆαžš αžšαž„ αž€αž„ αžšαžΆαž‡αž’αžœαž»αžαžαž αžΆαžαŸ‹ αž›αžΎ αž•αŸ’αž‘αŸƒ αž”αŸ’αžšαž‘αŸαžŸ αž‡αžΆ αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžš αž€αž„ αžšαžΆαž‡αž’αžœαžαžšαžŸαžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ ធម αžŠαŸ†αžŽαžΎαžš αžŠαŸ„αž™ មេ αž”αž‰αŸ’αž‡αžΆ αž€αžΆαžš αžšαŸ„αž„ αž”αžΆαž“ αžŠαžΉαž€αž“αžΆαŸ† αž€αŸ’αžšαž»αž˜ αž€αžΆαžšαž„αžΆαžš αž€αž„αžšαž±αžœαž»αžαŸ’αžšαž αž·αžŸ αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ αž…αŸαž‰ αž”αŸ’αžšαžαž”αžαŸ’αžαž·αž€αžΆαžšαž…αŸ‚αž… αž˜αŸ‰αžΆαžŸαŸ‹ αžŠαŸ‚αž› αžŸαž˜αŸ’αžŠαŸαž… αž’αž‚αŸ’αž‚ មហអ αžŸαžΆαž“αžΆαž”αžαžΈαžŠαŸ‚αž› αž‡αŸ„ αž αŸ†αžŸαžŸαŸ‚αž“αž”αžΆαž“ αž•αŸ’αžαž›αŸ‹ αž‡αžΌαž“ αžαžΆαž˜αžšαž™αŸˆ αžŸαžΆαž›αžΆ αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ αž‘αŸ… αž…αŸ‚αž€ αž‡αžΌαž“ αž”αŸ’αžšαž‡αžΆ αž–αž›αžšαžŠαŸ’αž‹ αž€αŸ’αž“αž»αž„ αž—αžΌαž˜αž·αžŸαžΆαžŸαŸ’αžαŸ’αžš αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ αž“αž·αž„ αž”αžΆαž“ αž”αž‰αŸ’αž…αŸαž‰ αž˜αŸ‰αžΌαžαžΌ αž€αž„αŸ‹ αž”αžΈ αž…αŸ†αž“αž½αž“ αžŠαž”αŸ‹ αž–αžΈαžš αž‚αŸ’αžšαžΏαž„ αž”αŸ†αž–αžΆαž€αŸ‹ αžŠαŸ„αž™ αž₯αžŸαŸ’αžαžŸαž“αžŸαž–αŸ’αž’αŸ’αžœαžΎ αž€αžΆαžš αž…αž€αŸ’αžš αž•αŸ’αžŸαžΆαž™ αž“αŸ… αžœαž·αž’αžΆαž“αž€αžΆαžš αž”αžΈ αž€αžΆαžšαž–αžΆαžš αž”αžΈ αž‚αž˜αžŽαŸαžšαž”αžŸαŸ‹ αž”αŸ’αžšαž˜αž»αž αž€αžšαžΆαž‡αžšαž›αžαž—αž·αž”αžΆαž› αž€αŸ αžŠαžΌαž… αž‡αžΆ αž€αŸ’αžšαžŸαž½αž„ សុខ αžαžΆαž—αž·αž”αžΆαž› តអម αžšαž™αŸˆαŸ’αžŸαžΆαžŸαŸ†αž›αŸαž„ αžŠαŸ‚αž› αž’αžœαžαŸ’αžšαž αž“αŸ…αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰ αž”αžΆαž“ αž…αž„ αž€αŸ’αžšαž»αž„ αž‘αžΎαž„ αžŠαžΎαž˜αŸ’αž”αžΈ αž’αŸ†αž–αžΆαž„αž“αžΆαžœ αž±αŸ’αž™ αž”αŸ’αžšαž‡αžΆ αž–αž›αžšαžŠαŸ’αž‹ αž…αžΌαž› រួម αžŸαžΆαž˜αž‚αŸ’αž‚αžΈ αž‚αŸ’αž“αžΆ αž”αž„αŸ’αž€αžΆαžš αž‘αž”αŸ‹ αžŸαŸ’αž€αžΆαžαŸ‹ αž€αžΆαžš αž†αŸ’αž›αž„ αžšαžΆαž€αžšαŸ€αž”αžŠαžΆαžš αž“αŸ… αž‡αŸ†αž„αžΊ αž€αžΌαžœαŸ€αž αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αž½αž“ αž€αŸ’αž“αž»αž„ αž–αŸ’αžšαžΉαžαŸ’αžαž·αž€αžΆαžš αžŸαž αž‚αž˜ αž˜αŸ’αž—αŸƒαž€αŸ†αž—αŸ’αžŠαŸ‚αž› αž”αžΆαž“ αž†αŸ’αž›αž„ αž…αžΌαž› αžŸαž αž‚αž˜ αž‘αŸ’αžšαž»αž„ αž‘αŸ’αžšαžΆαž™ αž’αŸ† αž“αž·αž„ αž€αŸ†αž–αž»αž„ αž˜αŸ‰αž αž€αžΎαž“ αž‘αžΎαž„ αž‚αž½αžš αž±αŸ’αž™ αž”αžΆαžšαž˜αŸ’αž˜

Ground truth

αž€αŸ’αžšαŸ„αž˜αž€αžΆαžšαž…αž„αŸ’αž’αž»αž›αž”αž„αŸ’αž αžΆαž‰αž–αžΈ αž›αŸ„αž€αž“αžΆαž™αž§αžαŸ’αžαž˜αžŸαŸαž“αžΈαž™αŸ αžŸαŸ… សុខអ αž’αž‚αŸ’αž‚αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžšαžšαž„ αž€αž„αž™αŸ„αž’αž–αž›αžαŸαž˜αžšαž—αžΌαž˜αž·αž“αŸ’αž‘ αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžš αž€αž„αžšαžΆαž‡αž’αžΆαžœαž»αž’αž αžαŸ’αžαž›αžΎαž•αŸ’αž‘αŸƒαž”αŸ’αžšαž‘αŸαžŸ αž“αž·αž„αžŠαŸ„αž™αž˜αžΆαž“αž€αžΆαžšαž”αŸ’αžšαž‚αž›αŸ‹αž—αžΆαžšαž€αž·αž…αŸ’αž…αž–αžΈαž―αž€αž§αžαŸ’αžαž˜ αžƒαž½αž„ αžŸαŸ’αžšαŸαž„ αž’αž—αž·αž”αžΆαž›αž“αŸƒαž‚αžŽαŸˆαž’αž—αž·αž”αžΆαž›αžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž“αŸ…αžšαžŸαŸ€αž›αžαŸ’αž„αŸƒαž‘αžΈαŸ’αŸ€ αžαŸ‚αž€αž»αž˜αŸ’αž—αŸˆ αž†αŸ’αž“αžΆαŸ†αŸ’αŸ αŸ’αŸ‘ αž›αŸ„αž€αž§αžαŸ’αžαž˜αžŸαŸαž“αžΈαž™αŸαž―αž€ αžšαŸαžαŸ’αž“ αžŸαŸŠαŸ’αžšαžΆαž„ αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžšαžšαž„ αž€αž„αžšαžΆαž‡αž’αžΆαžœαž»αž’αž αžαŸ’αžαž›αžΎαž•αŸ’αž‘αŸƒαž”αŸ’αžšαž‘αŸαžŸ αž‡αžΆαž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžšαž€αž„αžšαžΆαž‡αž’αžΆαžœαž»αž’αž αžαŸ’αžαžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž’αž˜αžŠαŸ†αžŽαžΎαžšαžŠαŸ„αž™ αž˜αŸαž”αž‰αŸ’αž‡αžΆαž€αžΆαžšαžšαž„ αž”αžΆαž“αžŠαžΉαž€αž“αžΆαŸ† αž€αŸ’αžšαž»αž˜αž€αžΆαžšαž„αžΆαžšαž€αž„αžšαžΆαž‡αž’αžΆαžœαž»αž’αž αžαŸ’αžαžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž…αŸαž‰αž”αŸ’αžšαžαž·αž”αžαŸ’αžαž·αž€αžΆαžš αž…αŸ‚αž€αž˜αŸ‰αžΆαžŸ αžŠαŸ‚αž›αžŸαž˜αŸ’αžαŸαž… αž’αž‚αŸ’αž‚αž˜αž αžΆαžŸαŸαž“αžΆαž”αžαžΈαžαŸαž‡αŸ„ αž αŸŠαž»αž“ αžŸαŸ‚αž“ αž”αžΆαž“αž•αŸ’αžαž›αŸ‹αž‡αžΌαž“ αžαžΆαž˜αžšαž™αŸˆαžŸαžΆαž›αžΆαžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž‘αŸ…αž…αŸ‚αž€αž‡αžΌαž“αž”αŸ’αžšαž‡αžΆαž–αž›αžšαžŠαŸ’αž‹αž€αŸ’αž“αž»αž„αž—αžΌαž˜αž·αžŸαžΆαžŸαŸ’αžšαŸ’αžαžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž“αž·αž„αž”αžΆαž“αž”αž‰αŸ’αž…αŸαž‰αž˜αŸ‰αžΌαžαžΌαž€αž„αŸ‹αž”αžΈαž…αŸ†αž“αž½αž“αŸ‘αŸ’αž‚αŸ’αžšαžΏαž„ αž”αŸ†αž–αžΆαž€αŸ‹αžŠαŸ„αž™ αž§αž‚αŸ’αžƒαŸ„αžŸαž“αžŸαŸαž–αŸ’αž‘ αž’αŸ’αžœαžΎαž€αžΆαžšαž…αžΆαž€αŸ‹αž•αŸ’αžŸαžΆαž™αž“αžΌαžœαžœαž·αž’αžΆαž“αž€αžΆαžš αŸ£αž€αžΆαžšαž–αžΆαžš αŸ£αž€αž»αŸ† αžšαž”αžŸαŸ‹αž”αŸ’αžšαž˜αž»αžαžšαžΆαž‡αžšαžΆαžŠαŸ’αž‹αžΆαž—αž·αž”αžΆαž› αž€αŸαžŠαžΌαž…αž‡αžΆ αž€αŸ’αžšαžŸαž½αž„αžŸαž»αžαžΆαž—αž·αž”αžΆαž› αžαžΆαž˜αžšαž™αŸˆαžŸαžΆαžšαžŸαž˜αŸ’αž›αŸαž„ αžŠαŸ‚αž›αž’αžΆαžœαž»αž’αž αžαŸ’αžαžšαžΆαž‡αž’αžΆαž“αžΈαž—αŸ’αž“αŸ†αž–αŸαž‰ αž”αžΆαž“αž…αž„αž€αŸ’αžšαž„αž‘αžΎαž„ αžŠαžΎαž˜αŸ’αž”αžΈαž’αŸ†αž–αžΆαžœαž“αžΆαžœαž²αŸ’αž™αž”αŸ’αžšαž‡αžΆαž–αž›αžšαžŠαŸ’αž‹αž…αžΌαž›αžšαž½αž˜αžŸαžΆαž˜αž‚αŸ’αž‚αžΈαž‚αŸ’αž“αžΆ αž”αž„αŸ’αž€αžΆαžš αž‘αž”αŸ‹αžŸαŸ’αž€αžΆαžαŸ‹ αž€αžΆαžšαž†αŸ’αž›αž„αžšαžΈαž€αžšαžΆαž›αžŠαžΆαž› αž“αžΌαžœαž‡αŸ†αž„αžΊ αž€αžΌαžœαžΈαžŠ ៑៩ αž€αŸ’αž“αž»αž„αž–αŸ’αžšαžΉαžαŸ’αžαž·αž€αžΆαžšαžŽαŸαžŸαž αž‚αž˜αž“αŸ αŸ’αŸ αž€αž»αž˜αŸ’αž—αŸˆ αžŠαŸ‚αž›αž”αžΆαž“αž†αŸ’αž›αž„αž…αžΌαž›αžŸαž αž‚αž˜αž“αŸαž‘αŸ’αžšαž„αŸ‹αž‘αŸ’αžšαžΆαž™αž’αŸ† αž“αž·αž„αž€αŸ†αž–αž»αž„αž”αž“αŸ’αžαž€αžΎαž“αž‘αžΎαž“αž‘αžΎαž„αž‚αž½αžšαž²αŸ’αž™αž”αžΆαžšαž˜αŸ’αž—

III. Wrap up

If you have questions or got issues while installing these, you can reach me on Twitter (opens in a new tab) or email me at seanghay.dev@gmail.com

2023 Β© Seanghay Yath