File Upload & Metadata Extraction Service
Find a file
Leonardo da Silva Calado dacaa9a267 Initial commit: Nexton File Upload & Metadata Extraction Service
A serverless file upload service I built with Node.js and AWS.

Key features:
- RESTful API with Express.js
- AWS Lambda for metadata extraction
- S3 storage with encryption
- DynamoDB for metadata storage
- API Gateway HTTP API deployment
- Lambda authorizer with API key authentication
- Comprehensive test suite (53 unit tests)
- Security hardening (XSS, injection prevention)

Tech stack I chose:
- Node.js + Express
- AWS S3, Lambda, DynamoDB, API Gateway
- Multer for file uploads
- wkhtmltopdf for PDF testing

Author: Leonardo da Silva Calado
2025-12-02 13:51:52 +00:00
scripts Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
src Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
.api-endpoint Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
.env.example Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
.gitignore Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
API_GATEWAY_SUCCESS.md Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
app.json Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
apprunner.yaml Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
DEPLOYMENT.md Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
DEPLOYMENT_STATUS.md Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
package.json Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
README.md Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00
SECURITY.md Initial commit: Nexton File Upload & Metadata Extraction Service 2025-12-02 13:51:52 +00:00

Nexton File Upload & Metadata Extraction Service

A serverless file upload service I built with Node.js and AWS. The service accepts file uploads via REST API, stores them in S3, and automatically extracts metadata using Lambda functions.

Architecture Overview

┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│   Client    │─────>│  Express API │─────>│  Amazon S3  │
│ Application │      │  (/upload)   │      │   (Storage) │
└─────────────┘      └──────────────┘      └─────────────┘
                            │                      │
                            │                      │ S3 Event
                            V                      │ Trigger
                     ┌──────────────┐              │
                     │  DynamoDB    │              V
                     │  (Metadata)  │<─────┌─────────────┐
                     └──────────────┘      │AWS Lambda   │
                            ^              │(Metadata    │
                            │              │Extraction)  │
                            │              └─────────────┘
                     ┌──────────────┐
                     │  Express API │
                     │ (/metadata)  │
                     └──────────────┘

Components I'm Using

  1. Express REST API - Handles file uploads and metadata queries
  2. Amazon S3 - File storage with encryption
  3. AWS Lambda - Event-driven metadata extraction
  4. DynamoDB - Metadata storage
  5. API Gateway - Serverless API deployment with Lambda authorizer

Features

  • RESTful API for file uploads and metadata retrieval
  • PDF and image file support (JPEG, PNG)
  • Automatic metadata extraction (file size, type, PDF page count, text content)
  • User-defined metadata with validation
  • S3 encryption for secure file storage
  • Event-driven processing with Lambda
  • Unique file identifiers (UUID v4)
  • API key authentication with Lambda authorizer
  • Input validation and sanitization

Prerequisites

  • Node.js >= 18.0.0 (tested with v24.11.1)
  • AWS CLI configured with appropriate credentials
  • AWS Account with permissions to create:
    • S3 buckets
    • DynamoDB tables
    • Lambda functions
    • IAM roles and policies
  • Yarn package manager

Installation & Setup

1. Clone and Install Dependencies

cd /home/leonardo/projects/nexton_upload
yarn install

2. Configure Environment Variables

cp .env.example .env

Edit .env with your AWS configuration (values will be populated after setup script):

AWS_REGION=us-east-1
AWS_ACCOUNT_ID=your-account-id
S3_BUCKET_NAME=nexton-file-uploads
DYNAMODB_TABLE_NAME=nexton-file-metadata
LAMBDA_FUNCTION_NAME=nexton-metadata-extractor
PORT=3000
NODE_ENV=development
MAX_FILE_SIZE_MB=10
ALLOWED_FILE_TYPES=application/pdf,image/jpeg,image/png,image/jpg

3. Deploy AWS Infrastructure

Run the automated setup script to create all AWS resources:

./scripts/setup-aws-resources.sh

This script will:

  • Create an S3 bucket with versioning, encryption, and private access
  • Create a DynamoDB table with optimized indexes
  • Create IAM roles with least-privilege permissions
  • Deploy the Lambda function for metadata extraction
  • Configure S3 event notifications to trigger Lambda

Update your .env file with the output from the setup script.

4. Start the API Server

npm start
# or for development with auto-reload:
npm run dev

The server will start on http://localhost:3000 (or your configured PORT).

API Documentation

Authentication

All API requests require authentication using an API key.

Include the API key in the x-api-key header:

# Get your API key
API_KEY=$(cat .api-key)

# Use it in requests
curl -H "x-api-key: $API_KEY" https://your-api-endpoint.com/...

Without valid API key, you'll receive:

{
  "message": "Unauthorized"
}

Status: 401 Unauthorized

Health Check

GET /health

Check service status.

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-01T10:00:00.000Z",
  "service": "nexton-file-upload-service",
  "version": "1.0.0"
}

Upload File

POST /upload

Upload a file with optional metadata.

Request:

  • Content-Type: multipart/form-data
  • Body:
    • file: File to upload (required)
    • metadata: JSON string with user metadata (optional)

Example using curl:

# Load API key
API_KEY=$(cat .api-key)

# Upload file
curl -X POST http://localhost:3000/upload \
  -H "x-api-key: $API_KEY" \
  -F "file=@document.pdf" \
  -F 'metadata={"author":"John Doe","expirationDate":"2025-12-31","description":"Important document"}'

Example using JavaScript:

const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('metadata', JSON.stringify({
  author: 'John Doe',
  expirationDate: '2025-12-31',
  description: 'Important document'
}));

const response = await fetch('http://localhost:3000/upload', {
  method: 'POST',
  headers: {
    'x-api-key': 'your-api-key-here'
  },
  body: formData
});

const result = await response.json();
console.log('File ID:', result.file_id);

Success Response (201):

{
  "success": true,
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "message": "File uploaded successfully. Metadata extraction in progress.",
  "details": {
    "fileName": "document.pdf",
    "fileSize": 102400,
    "contentType": "application/pdf",
    "uploadedAt": "2025-12-01T10:00:00.000Z",
    "status": "processing"
  }
}

Error Response (400):

{
  "success": false,
  "error": "Validation failed",
  "details": [
    "File size exceeds maximum allowed size of 10MB"
  ]
}

Retrieve Metadata

GET /metadata/:file_id

Retrieve complete metadata for a file.

Example:

# Load API key
API_KEY=$(cat .api-key)

# Get metadata
curl -H "x-api-key: $API_KEY" \
  http://localhost:3000/metadata/123e4567-e89b-12d3-a456-426614174000

Success Response (200):

{
  "success": true,
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "metadata": {
    "fileName": "document.pdf",
    "contentType": "application/pdf",
    "fileSize": 102400,
    "uploadedAt": "2025-12-01T10:00:00.000Z",
    "processedAt": "2025-12-01T10:00:05.000Z",
    "status": "completed",
    "userMetadata": {
      "author": "john-doe",
      "expirationdate": "2025-12-31",
      "description": "Important document"
    },
    "extractedMetadata": {
      "fileSize": 102400,
      "fileSizeFormatted": "100 KB",
      "contentType": "application/pdf",
      "type": "pdf",
      "numberOfPages": 5,
      "pdfInfo": {
        "producer": "Adobe PDF Library 15.0",
        "creator": "Microsoft Word",
        "title": "Document Title"
      },
      "textSample": "This is the beginning of the document...",
      "textLength": 5420,
      "hasText": true,
      "extractedAt": "2025-12-01T10:00:05.000Z"
    },
    "s3Location": "s3://nexton-file-uploads/uploads/123e4567-e89b-12d3-a456-426614174000"
  }
}

Status Values:

  • processing: Lambda function is still extracting metadata
  • completed: Metadata extraction completed successfully
  • failed: Metadata extraction failed (see errorMessage field)

Error Response (404):

{
  "success": false,
  "error": "File not found",
  "file_id": "123e4567-e89b-12d3-a456-426614174000"
}

Testing

Run Unit Tests

npm test

Run Integration Tests

Ensure the API server is running and AWS resources are deployed:

npm start  # In one terminal
npm test   # In another terminal

Manual Testing

  1. Upload a PDF:
curl -X POST http://localhost:3000/upload \
  -F "file=@test.pdf" \
  -F 'metadata={"author":"Test User"}'
  1. Check metadata (wait a few seconds for Lambda processing):
curl http://localhost:3000/metadata/<file_id_from_upload>
  1. Verify in AWS Console:
    • Check S3 bucket for uploaded file
    • Check DynamoDB table for metadata
    • Check CloudWatch Logs for Lambda execution

Security Measures I Implemented

  1. Authentication & Authorization

    • API Key Authentication - I secured all endpoints with Lambda authorizer
    • IAM roles with least-privilege access
    • S3 bucket with private access (no public reads)
    • Lambda execution roles limited to specific resources
    • API keys stored securely (.api-key file excluded from git)
  2. Data Protection

    • S3 server-side encryption (AES-256)
    • S3 versioning enabled for data recovery
    • HTTPS-only API through API Gateway
  3. Input Validation

    • File type validation (MIME type + extension check)
    • File size limits (default 10MB, configurable)
    • Metadata sanitization to prevent injection attacks
    • UUID validation for file IDs
  4. Error Handling

    • Sanitized error messages (no sensitive data exposed)
    • Server-side logging for debugging
    • Graceful degradation on service failures

Monitoring & Debugging

CloudWatch Logs

  • Lambda Logs: /aws/lambda/nexton-metadata-extractor
  • API Logs: Configure application logging as needed

Useful AWS CLI Commands

# Check Lambda function status
/snap/bin/aws lambda get-function --function-name nexton-metadata-extractor --region us-east-1

# View recent Lambda logs
/snap/bin/aws logs tail /aws/lambda/nexton-metadata-extractor --follow --region us-east-1

# List files in S3 bucket
/snap/bin/aws s3 ls s3://nexton-file-uploads/uploads/

# Query DynamoDB for file
/snap/bin/aws dynamodb get-item \
  --table-name nexton-file-metadata \
  --key '{"fileId":{"S":"<your-file-id>"}}' \
  --region us-east-1

# Check DynamoDB table status
/snap/bin/aws dynamodb describe-table \
  --table-name nexton-file-metadata \
  --region us-east-1

Why I Chose These Technologies

Express.js

  • In my opinion, it's the most straightforward framework for building APIs
  • Great middleware ecosystem (I'm using multer for file uploads)
  • Easy to understand and maintain

S3 for Storage

  • Virtually unlimited scalability
  • Built-in encryption and versioning
  • Event notifications trigger my Lambda functions automatically
  • Cost-effective

Lambda for Processing

  • I only pay for actual compute time
  • Auto-scaling without managing servers
  • Perfect for event-driven architecture

DynamoDB for Metadata

  • Fast response times
  • Flexible schema - I can store different metadata for different file types
  • Automatic scaling

API Gateway + Lambda Authorizer

  • I chose this for serverless deployment
  • Lambda authorizer lets me validate API keys on every request
  • No servers to manage, scales automatically

Design Patterns I Used

  1. Event-Driven Architecture - S3 events trigger Lambda asynchronously
  2. Separation of Concerns - API, storage, and processing are decoupled
  3. Idempotency - Unique UUIDs prevent duplicate processing
  4. Graceful Degradation - API returns success even if Lambda fails

Challenges I Solved

Asynchronous Processing

The Problem: Metadata extraction takes time; I didn't want clients waiting.

My Solution:

  • Upload returns immediately with file_id
  • Status field indicates processing state
  • Clients poll /metadata/:file_id to check completion

Large File Uploads

The Problem: Memory constraints with large files.

My Solution:

  • I set configurable file size limits (default 10MB)
  • Using memory storage in multer for direct S3 upload (no disk I/O)
  • Tuned Lambda memory allocation to 512MB

Concurrent Uploads

The Problem: Multiple uploads could cause race conditions.

My Solution:

  • UUID v4 for globally unique file identifiers
  • DynamoDB conditional writes prevent overwrites
  • S3 object versioning for data protection

Security

The Problem: File uploads are a common attack vector.

My Solution:

  • Strict MIME type validation
  • File extension verification
  • Metadata sanitization to prevent injection attacks
  • S3 private access only
  • IAM least-privilege principles
  • API key authentication on all endpoints

Project Scope

File Types: I'm supporting PDF, JPEG, and PNG files.

File Sizes: I set a 10MB limit which I think is suitable for most documents.

Metadata Storage: User metadata is stored as key-value pairs in DynamoDB.

Scalability: The serverless design handles thousands of concurrent uploads since S3 and Lambda auto-scale.

Data Retention: No automatic expiration - files are kept indefinitely unless manually deleted.

Geographic Distribution: Single region deployment (us-east-1).

Deployment

I'm using API Gateway + Lambda for serverless deployment:

# Deploy or update the API
bash scripts/deploy-api-gateway.sh

# Add authorization (if not configured)
bash scripts/add-api-authorization.sh

See DEPLOYMENT_STATUS.md for complete deployment documentation.

Cleanup

To delete all AWS resources:

./scripts/cleanup-aws-resources.sh

Warning: This will permanently delete all uploaded files and metadata!

Project Structure

nexton_upload/
├── src/
│   ├── api/
│   │   └── server.js              # Express API server
│   ├── lambda/
│   │   ├── metadataExtractor.js   # Lambda function code
│   │   └── package.json           # Lambda dependencies
│   ├── utils/
│   │   ├── s3.js                  # S3 operations
│   │   ├── dynamodb.js            # DynamoDB operations
│   │   └── validation.js          # Input validation
│   └── tests/                     # Unit, integration, E2E tests
├── scripts/
│   ├── setup-aws-resources.sh     # Infrastructure setup
│   ├── deploy-api-gateway.sh      # API Gateway deployment
│   ├── add-api-authorization.sh   # Add Lambda authorizer
│   └── cleanup-aws-resources.sh   # Resource cleanup
├── package.json
├── .env.example
└── README.md

Author

Leonardo da Silva Calado