Skip to main content
  1. Data Science Blog/

Audio Video Processing Concepts

·1384 words·7 mins· loading · ·
Graphics & Multimedia Computer Vision (CV) Audio Processing Video Processing Signal Processing Image Processing Computer Vision Machine Learning (ML)

On This Page

Table of Contents
Share with :

Audio Video Processing Concepts

Foundational Concepts of Audio and Video Processing
#

Whether you are multimedia professional or deep learning Engineer, if you are dealing with audio and video processing, you will need to understand the core concepts of audio and video processing. My this guide is focussed on some of the key concepts of Audio Video processing.

A digital microphone is a device that captures sound (air pressure variations) and converts it into digital signals. Internally, it contains an analog-to-digital converter (ADC) that performs the conversion from analog audio to digital audio.

A digital camera captures light (optical signals) through an image sensor (like CMOS or CCD), which generates analog electrical signals. These signals are then passed through an ADC to be converted into digital image data.

Let’s understand Image and Colors
#

An image is a grid of pixels, where each pixel represents the color and intensity of a small part of the image.

  • A black-and-white (grayscale) image typically has one channel (layer), where each pixel value ranges from 0 (black) to 255 (white).
  • A color image usually uses three channels (RGB)
  • Red, Green, and Blue β€” each with a value between 0 and 255, requiring 3 bytes per pixel, visible on the screen. If the image supports transparency, it includes a fourth channel called Alpha, which represents opacity β€” it’s value varies from 0 (fully transparent) to 255 (fully opaque).
  • A pixel in an RGBA image takes 4 bytes (1 byte per channel).
  • A 1024Γ—1024 RGB image has 2²⁰ pixels, and since each pixel uses 3 bytes, the total uncompressed size is 3 Γ— 2²⁰ = 3,145,728 bytes (~3 MB). If this is a png file (contains alpha channel) then it will take 4 MB space, if it is grayscale image then it will take only 1 MB space.

Let’s understand Sound and Air Pressure Signals.
#

Sound is a continuous analog signal created by air pressure variations. But, in digital audio, this signal is captured and sampled at regular intervals.

  • A sample is one measurement of amplitude at a given point in time.
  • The number of samples per second is the sample rate (e.g., 44.1kHz means 44,100 samples per second).
  • Each sample can be stored using:
    • 8-bit (1 byte) β†’ 256 possible values
    • 16-bit (2 bytes) β†’ 65,536 values (more common for quality audio)
    • 24-bit or 32-bit β†’ used in high-fidelity applications
  • Generally, a sample can be represented by one byte, but most modern audio uses 16-bit samples, meaning 2 bytes per sample.
  • Audio file size
    • If Sample rate: 44,100 samples/sec
    • and Sample size: 16-bit (2 bytes) (in Audacity software they refer this as format)
    • and Channels: 1 (mono) (for all voice signals)
    • then file size per second = 44,100 samples * 2 bytes = 88,200 bytes β‰ˆ 86.1 KB
    • but for stereo (2 channels), it doubles:
    • 44,100 * 2 bytes * 2 channels = 176,400 bytes β‰ˆ 172.3 KB

🎧 Audio Processing Concepts
#

1. Sampling Rate
#

  • Number of samples taken per second (e.g., 44.1kHz, 16kHz).
  • Higher rate = more detail, bigger files.
  • Common: 44.1kHz (music), 16kHz (speech), 8kHz (telephony).

2. Bit Depth
#

  • Number of bits per sample (e.g., 16-bit, 24-bit).
  • Higher bit depth = more dynamic range.

3. Channels
#

  • Mono = 1 channel, Stereo = 2 channels.
  • Multichannel = 5.1, 7.1 surround sound.

4. Bitrate
#

  • Data rate (e.g., 128kbps, 64kbps).
  • Affects audio quality and file size.

5. Audio Codecs
#

  • Compression algorithms (e.g., MP3, AAC, Opus, FLAC).
  • Lossy (MP3) vs Lossless (FLAC).

6. Waveform & Spectrogram
#

  • Visual tools to represent audio signals in time (waveform) or frequency (spectrogram).

7. Noise Reduction, Equalization, Normalization
#

  • Signal processing techniques for improving or adjusting audio.

9. Decibal (dB)
#

What is movie?
#

A movie is a series of images (called frames) played in rapid succession to create the illusion of motion. When a tool displays 24 images per second, the human eye perceives continuous movement β€” this is known as the frame rate (24 fps). Higher frame rates like 60 fps are often used for action sequences or sports, where smoother and more fluid motion is needed.

πŸŽ₯ Video Processing Concepts
#

1. Resolution
#

  • Frame size: 1920x1080 (1080p), 1280x720 (720p), etc.
  • Higher res = sharper image + larger file size.

2. Frame Rate (fps)
#

  • Frames per second: 24, 30, 60 fps.
  • Higher = smoother motion, but more data. For action movies to capture every transition we need high frame rates. For natural scenes or slow motions or slow documentaries movies we need lower frame rate.

3. Aspect Ratio
#

  • Width to height ratio (e.g., 16:9, 4:3, 9:16 for vertical videos).

4. Bitrate
#

  • Controls video quality & size (e.g., 1500kbps).
  • Can be constant (CBR) or variable (VBR).

Great β€” these are crucial video processing concepts, especially for your article! Here’s a more detailed and clarified explanation for each section so readers can truly understand how these parts work together in a video file:


🎞️ 5. Video Codecs
#

A codec (short for coder-decoder) is an algorithm that compresses and decompresses video to reduce file size without (too much) loss of quality.

Popular video codecs:

  • H.264 (AVC): Most widely supported. Excellent balance of quality, size, and speed. Used in YouTube, streaming, mobile devices.
  • H.265 (HEVC): Next-gen version of H.264. ~50% better compression at the same quality β€” but requires more processing power and has limited hardware support on older devices.
  • VP9: Open-source alternative to H.265 developed by Google. Used in YouTube for higher resolution videos.
  • AV1: Newer open-source codec designed for modern high-res streaming (4K/8K). Best compression, but slower encoding and not yet universally supported in hardware.

πŸ‘‰ Summary:
#

CodecCompressionSupportUse Case
H.264GoodUniversalStandard streaming & recording
H.265ExcellentLimited (older devices may lag)4K, HEVC content
VP9Very GoodWeb (Chrome/YouTube)High-res web video
AV1BestGrowing (newer devices)Future-proof streaming

πŸ“¦ 6. Containers / Formats
#

A container is like a box that holds different types of data streams: video, audio, subtitles, metadata β€” all bundled into a single file.

Popular containers:

  • MP4: Most common, supports H.264, H.265. Compatible with nearly everything (browsers, phones, TVs).
  • MKV: Open-source, supports almost any codec, very flexible. Common for high-quality video (e.g., Blu-ray rips).
  • MOV: Apple’s container format. High quality, used in professional editing.
  • AVI: Older Microsoft format, less efficient, but still seen in legacy apps.

πŸ‘‰ Analogy:
Think of codec = language, and container = suitcase that holds the movie + soundtrack + extras.


🎨 7. Color Space & Chroma Subsampling
#

Color Space
#

  • A method of representing color. Common ones:
    • RGB: Red-Green-Blue β€” used for monitors, raw images.
    • YUV (or YCbCr): Used in video β€” separates luma (Y) = brightness, and chroma (U/V) = color.
      • Humans perceive brightness better than color, so we can compress color info more heavily β€” which brings us to…

Chroma Subsampling
#

  • Technique to reduce file size by lowering color resolution while preserving brightness.
  • Formats:
    • 4:4:4 β€” no subsampling (full color + brightness)
    • 4:2:2 β€” some color reduction
    • 4:2:0 β€” most common in compressed video (e.g., H.264)

βœ… 4:2:0 means:

  • For every 4 luma (Y) samples, only 1 chroma sample for U and V.
  • Result: Good visual quality + great compression.

πŸ” 8. Keyframes & GOP (Group of Pictures)
#

Video compression doesn’t store every full frame β€” it stores changes between frames to save space.

  • I-frame (Intra-coded frame): A full image. Like a JPEG. Can be decoded on its own.
  • P-frame (Predicted frame): Stores changes from the previous frame.
  • B-frame (Bi-directional frame): Stores changes between previous and future frames.

GOP = Group of Pictures
#

  • A sequence like: I B B P B B P ... I
  • Starts with an I-frame, followed by P/B-frames.
  • Smaller GOP = easier seeking/editing, larger GOP = better compression.

πŸ” Common to Both
#

1. Transcoding
#

  • Re-encoding from one format/codec to another.

2. Compression
#

  • Lossy (discard some data) vs Lossless (no data loss).

3. Latency / Sync
#

  • Keeping audio/video aligned.

4. Streaming vs Local Playback
#

  • Adaptive bitrate streaming (HLS, DASH), buffering, encoding-on-the-fly.

πŸ›  Tools to Learn
#

  • FFmpeg – Swiss Army knife for audio/video.
  • Audacity – Audio editing.
  • HandBrake – GUI video transcoder.
  • OBS Studio – Recording & streaming.
  • Adobe Premiere / DaVinci Resolve – Professional video editing.
  • Python Libs – pydub, moviepy, ffmpeg-python, OpenCV for automation.
Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

What is a Digital Twin?
·805 words·4 mins· loading
Industry Applications Technology Trends & Future Computer Vision (CV) Digital Twin Internet of Things (IoT) Manufacturing Technology Artificial Intelligence (AI) Graphics
What is a digital twin? # A digital twin is a virtual representation of a real-world entity or …
Frequencies in Time and Space: Understanding Nyquist Theorem & its Applications
·4103 words·20 mins· loading
Data Analysis & Visualization Computer Vision (CV) Mathematics Signal Processing Space Exploration Statistics
Applications of Nyquists theorem # Can the Nyquist-Shannon sampling theorem applies to light …
The Real Story of Nyquist, Shannon, and the Science of Sampling
·1146 words·6 mins· loading
Technology Trends & Future Interdisciplinary Topics Signal Processing Remove Statistics Technology Concepts
The Story of Nyquist, Shannon, and the Science of Sampling # In the early days of the 20th century, …
BitNet b1.58-2B4T: Revolutionary Binary Neural Network for Efficient AI
·2637 words·13 mins· loading
AI/ML Models Artificial Intelligence (AI) AI Hardware & Infrastructure Neural Network Architectures AI Model Optimization Language Models (LLMs) Business Concepts Data Privacy Remove
Archive Paper Link BitNet b1.58-2B4T: The Future of Efficient AI Processing # A History of 1 bit …
Ollama Setup and Running Models
·1753 words·9 mins· loading
AI and NLP Ollama Models Ollama Large Language Models Local Models Cost Effective AI Models
Ollama: Running Large Language Models Locally # The landscape of Artificial Intelligence (AI) and …