File Upload & Metadata Extraction Service

Find a file

Leonardo da Silva Calado dacaa9a267 Initial commit: Nexton File Upload & Metadata Extraction Service A serverless file upload service I built with Node.js and AWS. Key features: - RESTful API with Express.js - AWS Lambda for metadata extraction - S3 storage with encryption - DynamoDB for metadata storage - API Gateway HTTP API deployment - Lambda authorizer with API key authentication - Comprehensive test suite (53 unit tests) - Security hardening (XSS, injection prevention) Tech stack I chose: - Node.js + Express - AWS S3, Lambda, DynamoDB, API Gateway - Multer for file uploads - wkhtmltopdf for PDF testing Author: Leonardo da Silva Calado		2025-12-02 13:51:52 +00:00
scripts	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
src	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
.api-endpoint	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
.env.example	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
.gitignore	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
API_GATEWAY_SUCCESS.md	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
app.json	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
apprunner.yaml	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
DEPLOYMENT.md	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
DEPLOYMENT_STATUS.md	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
package.json	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
README.md	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00
SECURITY.md	Initial commit: Nexton File Upload & Metadata Extraction Service	2025-12-02 13:51:52 +00:00

README.md

Nexton File Upload & Metadata Extraction Service

A serverless file upload service I built with Node.js and AWS. The service accepts file uploads via REST API, stores them in S3, and automatically extracts metadata using Lambda functions.

Architecture Overview

┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│   Client    │─────>│  Express API │─────>│  Amazon S3  │
│ Application │      │  (/upload)   │      │   (Storage) │
└─────────────┘      └──────────────┘      └─────────────┘
                            │                      │
                            │                      │ S3 Event
                            V                      │ Trigger
                     ┌──────────────┐              │
                     │  DynamoDB    │              V
                     │  (Metadata)  │<─────┌─────────────┐
                     └──────────────┘      │AWS Lambda   │
                            ^              │(Metadata    │
                            │              │Extraction)  │
                            │              └─────────────┘
                     ┌──────────────┐
                     │  Express API │
                     │ (/metadata)  │
                     └──────────────┘

Components I'm Using

Express REST API - Handles file uploads and metadata queries
Amazon S3 - File storage with encryption
AWS Lambda - Event-driven metadata extraction
DynamoDB - Metadata storage
API Gateway - Serverless API deployment with Lambda authorizer

Features

RESTful API for file uploads and metadata retrieval
PDF and image file support (JPEG, PNG)
Automatic metadata extraction (file size, type, PDF page count, text content)
User-defined metadata with validation
S3 encryption for secure file storage
Event-driven processing with Lambda
Unique file identifiers (UUID v4)
API key authentication with Lambda authorizer
Input validation and sanitization

Prerequisites

Node.js >= 18.0.0 (tested with v24.11.1)
AWS CLI configured with appropriate credentials
AWS Account with permissions to create:
- S3 buckets
- DynamoDB tables
- Lambda functions
- IAM roles and policies
Yarn package manager

Installation & Setup

1. Clone and Install Dependencies

cd /home/leonardo/projects/nexton_upload
yarn install

2. Configure Environment Variables

cp .env.example .env

Edit .env with your AWS configuration (values will be populated after setup script):

AWS_REGION=us-east-1
AWS_ACCOUNT_ID=your-account-id
S3_BUCKET_NAME=nexton-file-uploads
DYNAMODB_TABLE_NAME=nexton-file-metadata
LAMBDA_FUNCTION_NAME=nexton-metadata-extractor
PORT=3000
NODE_ENV=development
MAX_FILE_SIZE_MB=10
ALLOWED_FILE_TYPES=application/pdf,image/jpeg,image/png,image/jpg

3. Deploy AWS Infrastructure

Run the automated setup script to create all AWS resources:

./scripts/setup-aws-resources.sh

This script will:

Create an S3 bucket with versioning, encryption, and private access
Create a DynamoDB table with optimized indexes
Create IAM roles with least-privilege permissions
Deploy the Lambda function for metadata extraction
Configure S3 event notifications to trigger Lambda

Update your .env file with the output from the setup script.

4. Start the API Server

npm start
# or for development with auto-reload:
npm run dev

The server will start on http://localhost:3000 (or your configured PORT).

API Documentation

Authentication

All API requests require authentication using an API key.

Include the API key in the x-api-key header:

# Get your API key
API_KEY=$(cat .api-key)

# Use it in requests
curl -H "x-api-key: $API_KEY" https://your-api-endpoint.com/...

Without valid API key, you'll receive:

{
  "message": "Unauthorized"
}

Status: 401 Unauthorized

Health Check

GET /health

Check service status.

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-01T10:00:00.000Z",
  "service": "nexton-file-upload-service",
  "version": "1.0.0"
}

Upload File

POST /upload

Upload a file with optional metadata.

Request:

Content-Type: multipart/form-data
Body:
- file: File to upload (required)
- metadata: JSON string with user metadata (optional)

Example using curl:

# Load API key
API_KEY=$(cat .api-key)

# Upload file
curl -X POST http://localhost:3000/upload \
  -H "x-api-key: $API_KEY" \
  -F "file=@document.pdf" \
  -F 'metadata={"author":"John Doe","expirationDate":"2025-12-31","description":"Important document"}'

Example using JavaScript:

const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('metadata', JSON.stringify({
  author: 'John Doe',
  expirationDate: '2025-12-31',
  description: 'Important document'
}));

const response = await fetch('http://localhost:3000/upload', {
  method: 'POST',
  headers: {
    'x-api-key': 'your-api-key-here'
  },
  body: formData
});

const result = await response.json();
console.log('File ID:', result.file_id);

Success Response (201):

{
  "success": true,
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "message": "File uploaded successfully. Metadata extraction in progress.",
  "details": {
    "fileName": "document.pdf",
    "fileSize": 102400,
    "contentType": "application/pdf",
    "uploadedAt": "2025-12-01T10:00:00.000Z",
    "status": "processing"
  }
}

Error Response (400):

{
  "success": false,
  "error": "Validation failed",
  "details": [
    "File size exceeds maximum allowed size of 10MB"
  ]
}

Retrieve Metadata

GET /metadata/:file_id

Retrieve complete metadata for a file.

Example:

# Load API key
API_KEY=$(cat .api-key)

# Get metadata
curl -H "x-api-key: $API_KEY" \
  http://localhost:3000/metadata/123e4567-e89b-12d3-a456-426614174000

Success Response (200):

{
  "success": true,
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "metadata": {
    "fileName": "document.pdf",
    "contentType": "application/pdf",
    "fileSize": 102400,
    "uploadedAt": "2025-12-01T10:00:00.000Z",
    "processedAt": "2025-12-01T10:00:05.000Z",
    "status": "completed",
    "userMetadata": {
      "author": "john-doe",
      "expirationdate": "2025-12-31",
      "description": "Important document"
    },
    "extractedMetadata": {
      "fileSize": 102400,
      "fileSizeFormatted": "100 KB",
      "contentType": "application/pdf",
      "type": "pdf",
      "numberOfPages": 5,
      "pdfInfo": {
        "producer": "Adobe PDF Library 15.0",
        "creator": "Microsoft Word",
        "title": "Document Title"
      },
      "textSample": "This is the beginning of the document...",
      "textLength": 5420,
      "hasText": true,
      "extractedAt": "2025-12-01T10:00:05.000Z"
    },
    "s3Location": "s3://nexton-file-uploads/uploads/123e4567-e89b-12d3-a456-426614174000"
  }
}

Status Values:

processing: Lambda function is still extracting metadata
completed: Metadata extraction completed successfully
failed: Metadata extraction failed (see errorMessage field)

Error Response (404):

{
  "success": false,
  "error": "File not found",
  "file_id": "123e4567-e89b-12d3-a456-426614174000"
}

Testing

Run Unit Tests

npm test

Run Integration Tests

Ensure the API server is running and AWS resources are deployed:

npm start  # In one terminal
npm test   # In another terminal

Manual Testing

Upload a PDF:

curl -X POST http://localhost:3000/upload \
  -F "file=@test.pdf" \
  -F 'metadata={"author":"Test User"}'

Check metadata (wait a few seconds for Lambda processing):

curl http://localhost:3000/metadata/<file_id_from_upload>

Verify in AWS Console:
- Check S3 bucket for uploaded file
- Check DynamoDB table for metadata
- Check CloudWatch Logs for Lambda execution

Security Measures I Implemented

Authentication & Authorization
- API Key Authentication - I secured all endpoints with Lambda authorizer
- IAM roles with least-privilege access
- S3 bucket with private access (no public reads)
- Lambda execution roles limited to specific resources
- API keys stored securely (.api-key file excluded from git)
Data Protection
- S3 server-side encryption (AES-256)
- S3 versioning enabled for data recovery
- HTTPS-only API through API Gateway
Input Validation
- File type validation (MIME type + extension check)
- File size limits (default 10MB, configurable)
- Metadata sanitization to prevent injection attacks
- UUID validation for file IDs
Error Handling
- Sanitized error messages (no sensitive data exposed)
- Server-side logging for debugging
- Graceful degradation on service failures

Monitoring & Debugging

CloudWatch Logs

Lambda Logs: /aws/lambda/nexton-metadata-extractor
API Logs: Configure application logging as needed

Useful AWS CLI Commands

# Check Lambda function status
/snap/bin/aws lambda get-function --function-name nexton-metadata-extractor --region us-east-1

# View recent Lambda logs
/snap/bin/aws logs tail /aws/lambda/nexton-metadata-extractor --follow --region us-east-1

# List files in S3 bucket
/snap/bin/aws s3 ls s3://nexton-file-uploads/uploads/

# Query DynamoDB for file
/snap/bin/aws dynamodb get-item \
  --table-name nexton-file-metadata \
  --key '{"fileId":{"S":"<your-file-id>"}}' \
  --region us-east-1

# Check DynamoDB table status
/snap/bin/aws dynamodb describe-table \
  --table-name nexton-file-metadata \
  --region us-east-1

Why I Chose These Technologies

Express.js

In my opinion, it's the most straightforward framework for building APIs
Great middleware ecosystem (I'm using multer for file uploads)
Easy to understand and maintain

S3 for Storage

Virtually unlimited scalability
Built-in encryption and versioning
Event notifications trigger my Lambda functions automatically
Cost-effective

Lambda for Processing

I only pay for actual compute time
Auto-scaling without managing servers
Perfect for event-driven architecture

DynamoDB for Metadata

Fast response times
Flexible schema - I can store different metadata for different file types
Automatic scaling

API Gateway + Lambda Authorizer

I chose this for serverless deployment
Lambda authorizer lets me validate API keys on every request
No servers to manage, scales automatically

Design Patterns I Used

Event-Driven Architecture - S3 events trigger Lambda asynchronously
Separation of Concerns - API, storage, and processing are decoupled
Idempotency - Unique UUIDs prevent duplicate processing
Graceful Degradation - API returns success even if Lambda fails

Challenges I Solved

Asynchronous Processing

The Problem: Metadata extraction takes time; I didn't want clients waiting.

My Solution:

Upload returns immediately with file_id
Status field indicates processing state
Clients poll /metadata/:file_id to check completion

Large File Uploads

The Problem: Memory constraints with large files.

My Solution:

I set configurable file size limits (default 10MB)
Using memory storage in multer for direct S3 upload (no disk I/O)
Tuned Lambda memory allocation to 512MB

Concurrent Uploads

The Problem: Multiple uploads could cause race conditions.

My Solution:

UUID v4 for globally unique file identifiers
DynamoDB conditional writes prevent overwrites
S3 object versioning for data protection

Security

The Problem: File uploads are a common attack vector.

My Solution:

Strict MIME type validation
File extension verification
Metadata sanitization to prevent injection attacks
S3 private access only
IAM least-privilege principles
API key authentication on all endpoints

Project Scope

File Types: I'm supporting PDF, JPEG, and PNG files.

File Sizes: I set a 10MB limit which I think is suitable for most documents.

Metadata Storage: User metadata is stored as key-value pairs in DynamoDB.

Scalability: The serverless design handles thousands of concurrent uploads since S3 and Lambda auto-scale.

Data Retention: No automatic expiration - files are kept indefinitely unless manually deleted.

Geographic Distribution: Single region deployment (us-east-1).

Deployment

I'm using API Gateway + Lambda for serverless deployment:

# Deploy or update the API
bash scripts/deploy-api-gateway.sh

# Add authorization (if not configured)
bash scripts/add-api-authorization.sh

See DEPLOYMENT_STATUS.md for complete deployment documentation.

Cleanup

To delete all AWS resources:

./scripts/cleanup-aws-resources.sh

Warning: This will permanently delete all uploaded files and metadata!

Project Structure

nexton_upload/
├── src/
│   ├── api/
│   │   └── server.js              # Express API server
│   ├── lambda/
│   │   ├── metadataExtractor.js   # Lambda function code
│   │   └── package.json           # Lambda dependencies
│   ├── utils/
│   │   ├── s3.js                  # S3 operations
│   │   ├── dynamodb.js            # DynamoDB operations
│   │   └── validation.js          # Input validation
│   └── tests/                     # Unit, integration, E2E tests
├── scripts/
│   ├── setup-aws-resources.sh     # Infrastructure setup
│   ├── deploy-api-gateway.sh      # API Gateway deployment
│   ├── add-api-authorization.sh   # Add Lambda authorizer
│   └── cleanup-aws-resources.sh   # Resource cleanup
├── package.json
├── .env.example
└── README.md

Author

Leonardo da Silva Calado