Hey everyone! 👋 Let's dive into a super practical guide on how to get Apache Airflow up and running using Docker Compose and pip install. This setup is a fantastic way to quickly spin up an Airflow environment for local development, testing, or even a lightweight production deployment. We'll break down everything step-by-step, making it easy for both beginners and experienced users to follow along. So, grab your favorite beverage, and let's get started! Airflow, as many of you already know, is a powerful platform for programmatically authoring, scheduling, and monitoring workflows. It's used by data engineers, data scientists, and anyone who needs to orchestrate complex data pipelines. Docker Compose simplifies the process of defining and running multi-container Docker applications. It's essentially a YAML file that configures your application's services. Pip, on the other hand, is the package installer for Python, and we'll use it to install the Airflow Python package and its dependencies. This guide will walk you through setting up a basic Airflow environment, which includes the Airflow webserver, scheduler, and a PostgreSQL database for metadata storage. We'll also cover how to customize your Airflow configuration and start running your first DAG (Directed Acyclic Graph), which is the core unit of work in Airflow. This approach is excellent because it isolates your Airflow environment from your system's global Python packages, ensuring that you have the right versions of all dependencies and avoid any version conflicts. It makes it easier to share your environment with others or deploy it to different environments with consistency. Throughout this guide, we'll keep things clear and concise, providing you with all the necessary commands and configuration files. We'll also touch upon some common troubleshooting tips to help you if you encounter any issues along the way. So, whether you're new to Airflow or looking to improve your workflow, this guide is designed to help you get up and running quickly and efficiently. Let’s get our hands dirty!

    Prerequisites

    Before we jump into the fun stuff, let's make sure you have everything you need. You'll need a few tools installed on your system: Docker and Docker Compose. These tools are essential for running the containerized Airflow environment. Here's a quick rundown:

    1. Docker: Download and install Docker from the official Docker website (https://www.docker.com/). Docker allows you to package your application and its dependencies into a container. This ensures that the application runs consistently across different environments.
    2. Docker Compose: Docker Compose is typically included with the Docker installation. You can verify its installation by running docker-compose --version in your terminal. Docker Compose is used to define and manage multi-container Docker applications. It uses a YAML file to configure your application's services.
    3. Python and Pip: You should have Python installed on your system, along with the pip package manager. These are required for installing the Airflow Python package. You can check the installed Python version using python --version or python3 --version. If you don't have python, you can download it from the official website (https://www.python.org/).
    4. A Code Editor: You'll need a code editor or IDE to create and edit the configuration files and DAGs. Popular choices include VS Code, Sublime Text, or PyCharm.

    Make sure that both Docker and Docker Compose are installed correctly and that your user has the necessary permissions to run Docker commands. If you're using Linux, you might need to add your user to the docker group to avoid permission issues. You can do this using the command sudo usermod -aG docker $USER. Then, log out and log back in (or restart your system) for the changes to take effect. It's crucial that you have these tools installed and configured properly before proceeding, as they are the foundation for our Airflow setup. With these prerequisites in place, we can move forward and build our Airflow environment. Let's make sure you're ready to roll, and if you have any questions, don’t hesitate to ask! If you are missing one of these, go ahead and install it now, so you are ready to follow along with the next steps of this tutorial. Make sure you can run docker and docker-compose commands without any issues before proceeding. This confirms that Docker and Docker Compose are correctly installed and configured on your system. So, with these prerequisites in place, let's proceed to create our Airflow setup! Ready? Let's go!

    Setting up the Airflow Environment with Docker Compose

    Alright, guys! Now let's create a directory for our Airflow project. Open up your terminal or command prompt and run the following command to make a new directory and navigate into it:

    mkdir airflow-docker-compose-tutorial
    cd airflow-docker-compose-tutorial
    

    Next, we'll create the docker-compose.yml file. This file will define all the services that make up our Airflow environment. Create a new file named docker-compose.yml in your project directory and paste the following configuration. This file is the heart of our Docker Compose setup. It tells Docker Compose how to build and run our Airflow services. Ensure that you paste the code exactly as shown, with proper indentation. Any syntax errors in this file can prevent Airflow from running correctly. So double-check your code to make sure it matches the example provided. This docker-compose.yml file defines our entire Airflow infrastructure. Let’s break it down:

    version: "3.9"
    services:
      postgres:
        image: postgres:13
        environment:
          - POSTGRES_USER=airflow
          - POSTGRES_PASSWORD=airflow
          - POSTGRES_DB=airflow
        ports:
          - "5432:5432"
        volumes:
          - postgres_data:/var/lib/postgresql/data
        logging:
          level: INFO
      airflow-init:
        build:
          context: .
          dockerfile: Dockerfile
        depends_on:
          - postgres
        entrypoint: /bin/bash -c "airflow db init && airflow users create --username admin --password admin --email admin@example.com --firstname admin --lastname admin --role Admin && airflow connections add --conn-id postgres_default --conn-type postgres --conn-uri postgresql://airflow:airflow@postgres:5432/airflow"
        volumes:
          - ./dags:/opt/airflow/dags
          - ./plugins:/opt/airflow/plugins
          - ./logs:/opt/airflow/logs
        environment:
          - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
          - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
          - AIRFLOW__CELERY__BROKER_URL=redis://redis:6379/0
          - AIRFLOW__CELERY__RESULT_BACKEND=redis://redis:6379/0
          - AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth
      redis:
        image: redis:latest
        ports:
          - "6379:6379"
        volumes:
          - redis_data:/data
      webserver:
        build:
          context: .
          dockerfile: Dockerfile
        depends_on:
          - postgres
          - redis
        ports:
          - "8080:8080"
        healthcheck:
          test: [ "CMD", "curl", "-f", "http://localhost:8080/health" ]
          interval: 10s
          timeout: 5s
          retries: 5
        volumes:
          - ./dags:/opt/airflow/dags
          - ./plugins:/opt/airflow/plugins
          - ./logs:/opt/airflow/logs
        environment:
          - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
          - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
          - AIRFLOW__CELERY__BROKER_URL=redis://redis:6379/0
          - AIRFLOW__CELERY__RESULT_BACKEND=redis://redis:6379/0
          - AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth
      scheduler:
        build:
          context: .
          dockerfile: Dockerfile
        depends_on:
          - postgres
          - redis
        healthcheck:
          test: [ "CMD", "airflow", "jobs", "health" ]
          interval: 10s
          timeout: 60s
          retries: 5
        volumes:
          - ./dags:/opt/airflow/dags
          - ./plugins:/opt/airflow/plugins
          - ./logs:/opt/airflow/logs
        environment:
          - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
          - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
          - AIRFLOW__CELERY__BROKER_URL=redis://redis:6379/0
          - AIRFLOW__CELERY__RESULT_BACKEND=redis://redis:6379/0
          - AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth
      flower:
        build:
          context: .
          dockerfile: Dockerfile
        depends_on:
          - redis
        ports:
          - "5555:5555"
        environment:
          - AIRFLOW__CELERY__BROKER_URL=redis://redis:6379/0
          - AIRFLOW__CELERY__RESULT_BACKEND=redis://redis:6379/0
    volumes:
      postgres_data:
      redis_data:
    

    This docker-compose.yml file is the blueprint for our Airflow environment. It defines several services: postgres, redis, airflow-init, webserver, scheduler, and flower. Each service uses a specific Docker image and has its own configurations, like environment variables and volumes. Volumes are used to persist data and share files between the host machine and the containers. The depends_on setting in each service makes sure that the services start in the correct order. For example, the webserver and scheduler services depend on the postgres and redis services, ensuring that the database and Redis are up and running before these services are started. The airflow-init service performs the database initialization, creates an admin user, and adds a connection to the database. The webserver service exposes port 8080 for accessing the Airflow web interface, while the scheduler service manages the scheduling of your tasks. The flower service provides a monitoring tool for the Celery tasks. This file is comprehensive, providing a solid foundation for your Airflow setup. Remember to save this file in your project directory as docker-compose.yml. Don’t forget to add a Dockerfile which we'll create in the next step. Let’s get to it!

    Creating the Dockerfile

    Alright, let’s create the Dockerfile. This file will tell Docker how to build the Airflow images for the webserver, scheduler, and flower services. Create a new file named Dockerfile (without any extension) in the same directory as your docker-compose.yml file. Paste the following content into it: This Dockerfile is essential as it customizes the Docker images for our Airflow environment, allowing us to install the necessary packages and configure the environment variables.

    FROM apache/airflow:2.8.2
    USER root
    RUN apt-get update && apt-get install -y --no-install-recommends \
        && apt-get clean && rm -rf /var/lib/apt/lists/*
    USER airflow
    COPY requirements.txt ./
    RUN pip install --no-cache-dir -r requirements.txt --constraint "https://raw.githubusercontent.com/apache/airflow/constraints/2.8.2/constraints-3.9.txt"
    

    This Dockerfile is quite simple but effective. It starts by using the official Apache Airflow Docker image as a base image. We then update the system packages and install any necessary dependencies. After that, we switch to the airflow user. The COPY command copies your requirements.txt file (which we will create next) into the container. The RUN pip install command installs all the necessary Python packages using pip, taking the dependencies from the requirements.txt file and using the constraints file to ensure the correct versions of all packages. This ensures that the versions of the packages are compatible with the specified Airflow version. It's important to include the constraints file to avoid any version conflicts. Save this file as Dockerfile in the same directory as your docker-compose.yml file. Make sure you don't add any file extension. If you do this step correctly, our Dockerfile will create a custom image based on the official Airflow image. This allows us to customize our Airflow environment by installing additional Python packages and configuring other settings as needed. Don’t forget this step. Now, let’s move on to the next one!

    Creating the requirements.txt

    Now, let's create the requirements.txt file. This file lists all the Python packages that your Airflow environment needs. Create a new file named requirements.txt in the same directory as your docker-compose.yml and Dockerfile files. This is where we'll specify any extra Python packages we want to install in our Airflow environment. Here’s a basic example. You can add more dependencies as needed. The requirements.txt file specifies the Python packages that need to be installed in the Airflow environment. The packages listed in this file will be installed by pip during the Docker image build process. By including this file, we ensure that our environment has all the necessary Python dependencies.

    apache-airflow[cncf.kubernetes, celery, postgres, google]~=2.8.2
    

    In this example, we're installing the latest version of Airflow alongside the following extras: cncf.kubernetes, celery, postgres, and google. This includes support for Kubernetes, Celery, PostgreSQL, and Google Cloud Platform integrations. If you don't need these specific extras, you can modify the line to include just the dependencies you need. It's crucial to specify the Airflow version using the ~= operator to ensure compatibility. This file is crucial for specifying the Python packages your Airflow environment needs. Make sure to tailor the contents of this file to fit your project’s needs. Ensure you save it in the same directory as your docker-compose.yml and Dockerfile. Got it? Great, let’s keep going!

    Running Airflow with Docker Compose

    Now for the moment of truth! 🥳 We're going to use docker-compose to build and run our Airflow environment. Open up your terminal in the same directory where you have your docker-compose.yml, Dockerfile, and requirements.txt files and run the following command. The docker-compose up command is the main command for starting your Airflow environment. The -d flag tells Docker Compose to run the services in detached mode, meaning they will run in the background. It will build the images if they don't exist and then start all the services defined in your docker-compose.yml file.

    docker-compose up -d
    

    This command will: Build the necessary Docker images based on your Dockerfile and the official Airflow image if they don't already exist. Start all the services defined in your docker-compose.yml file, including the database, webserver, scheduler, and any other services you've defined. You’ll see a bunch of output as Docker downloads images, builds containers, and starts the services. Give it a few minutes to finish, especially the first time, as it needs to download and set up all the components. After a successful start, you should be able to access the Airflow web interface. Once the containers are up and running, Docker Compose will output the logs from each service to your terminal. This is useful for monitoring the status of your services and troubleshooting any issues that might arise. If everything goes well, you should see the Airflow webserver running without any errors. If you face any issues, carefully review the logs to identify the problem. The Docker Compose logs will provide valuable insights into any errors that occur during the startup process. With this step, you will bring up your Airflow environment. Remember to keep an eye on the logs for any errors. Now, let’s move on to verify our setup!

    Accessing the Airflow Web UI

    Once the containers are up and running, you can access the Airflow web UI. Open your web browser and go to http://localhost:8080. You should see the Airflow login page. Remember, we created an admin user during the setup process with the username and password both set to admin. Input those credentials and click