Demo Video

1. Creating a function that gets the source code of the web page

we’ll be scraping Wikitionary !
getsource(){
    WORD=leben
    LANG=de # de for german (en for english etc)
    URL="https://${LANG}.wiktionary.org/wiki/${WORD}"
    curl $URL > url.html
}
getsource # calling the funciton  
  • We have Declared a variable called WORD
  • Used curl to the get the source code of the page & redirect the oupout to url.html

The output of url.html will look something like this:

2. Formating and cleaning up

Our goal is to get the audio files of the german word leben
getaudio(){
grep -Eo "//[a-zA-Z0-9./?=_%:-]*\.+(ogg|mp3|flac|aac|wav)" url.html
}
# ogg, mp3, flac, aac, wav are extensions of audio files
Regex Explanation:
  • [-] Matches any character within the range
    • [a-zA-Z] Matches any alphabetical letter lower case and upper case
    • [0-9] Matches any number
  • . This matches any one character.
  • + This means that the preceding item must match one or more times.
  • * This means that the preceding item must match
  • \ This is the escape character for escaping any of the special characters mentioned previously.
  • ? This means that the preceding item must match one or zero times.
  • | This specifies the alternation .. One item on either of the sides of | should match
  • () This treats the terms enclosed as one entity
    • Example : ma(tri)?x matches max or matrix.

The ouptput will give us the following links:

//upload.wikimedia.org/wikipedia/commons/3/3f/De-leben.ogg
//upload.wikimedia.org/wikipedia/commons/c/c2/De-leben2.ogg
//upload.wikimedia.org/wikipedia/commons/8/8a/De-riskant_leben.ogg
//upload.wikimedia.org/wikipedia/commons/6/6f/De-in_beschr%C3%A4nkten_Verh%C3%A4ltnissen_leben.ogg

All Good !

Note : We need to add https: in bignning of each line !

Let’s put it all together

getsource(){
    WORD=leben
    LANG=de # de for german (en for english etc)
    URL="https://${LANG}.wiktionary.org/wiki/${WORD}"
    curl $URL
}

getaudio(){
grep -Eo "//[a-zA-Z0-9./?=_%:-]*\.+(ogg|mp3|flac|aac|wav)"
}
getsource | getaudio | xargs -I {} echo "https:{}" > audios_list 
# and optionnaly download them with 
# yt-dlp -a audio_list
#Or 
while read line; do
    wget -N $line &
done < audios_list 

For a more practical script checkout this sample on github

The script used in the video:

For the latest script version check out my github acc :

Github : https://github.com/AnasBoubechra/Pronounce_this

Click to expand
set -e

lang=
query=
tmpfile=

dmenu=false
fzf=false

download_dir="$HOME/.pt"
version="dev 2.0"

sname="$(basename $0)"

show_help(){
    printf "
    Usage:  $sname [ -q ARG ] [ -l ARG ] [vfdm]

         -q  For the search query !
               Example: $sname -q hallo

         -l  To add a language code
               Example  $sname -l en -q hello

         -d  To download the audios and store them locally. The default path is ${download_dir}
                
                * For each query a folder will be created and store all the audios inside it !
                * Support offline usage. 

         -v  Show version

         -m  To use dmenu

         -f  To use fzf
"
}

getsource(){

    trap cleanup INT QUIT TERM EXIT

    query=$(printf "$query" | tr '[:upper:]' '[:lower:]')
    tmpfile=`mktemp`

    cleanup(){
        [ -f $tmpfile ] && rm $tmpfile
    }
    curl -s "https://${lang:=en}.wiktionary.org/wiki/${query}" | \
        grep -Eo '//upload[a-zA-Z0-9./?=_%:-]*\.+(ogg|mp3|wav|aac)' | sed 's/\/\//https:&/g' >$tmpfile
}

show_version(){
        printf "$sname Version: %s\n" "$version"
        exit 0
}


check_dep(){
    if ! command -v "$1";then
        printf "$1 is not installed !\n" && exit 127
    fi
}

_main_(){
    if ls -A "$aud_dir" 2> /dev/null;then # check if a dir is not empty instead of existance
        selected=$(ls "$aud_dir" | $1)
        mpv "${aud_dir}/${selected}"
        exit 0
    else
        getsource
        if [ -s $tmpfile ];then
            selected=$(rev $tmpfile | cut -d'/' -f 1 | rev | $1 )
            grep $selected $tmpfile | mpv --playlist=-
            [ $download ] && download_aud
        else
            printf "-> No results found for %s :/\n" "$query" >&2
            exit 0
        fi
    fi
}

download_aud(){
    mkdir -p "${aud_dir}" # Buggy
    printf "Downloading the audios ..."
    cat $tmpfile | parallel curl -O --output-dir "${aud_dir}" && printf "Download completed." \
        || (printf "Enable to download the audio files !" && exit 1)
}

while getopts "q:hl:dfmv" OPT; do
    case "$OPT" in 
        h) show_help && exit 0 ;; 
        q) query=$OPTARG ;;
        l) lang=$OPTARG ;;
        f) fzf=true;;
        d) download=true ;;
        m) dmenu=true ;;
        v) show_version ;;
        *) printf "Wrong usage !\n Show help with: $sname -h\n" >&2 && exit 1;;
    esac
done


# If the query is empty or is not a casual word
if ! expr "$query" : "[a-zA-Z]" 1>/dev/null; then
    printf "Error: The query must be a non empty letter !\n" >&2  
    exit 1
fi

[ $dmenu = "true" ] && [ $fzf = "true" ] && \
    printf "Error: -m and -f are mutually exclusive and may only be used once\n" >&2 && exit 2

aud_dir="${download_dir}/${query}"

if $dmenu;then
    check_dep dmenu
     _main_ 'dmenu -l 10'
elif $fzf
then
    check_dep fzf 
    _main_ "fzf --reverse --height=40%"
else
    _main_ "head -n 1"
fi

Enjoy 🤓 !

If you have any insights or suggestions, I would love to hear them 🙂.