HDFS application to S3 using S3a connector
NickName:Subash Kunjupillai Ask DateTime:2021-10-07T00:48:30

HDFS application to S3 using S3a connector

I'm trying to understand the capability of S3a connector here for the use case where I have to run my current HDFS based application over a S3 storage without much change to the application.

On quick glance over S3a document (https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), I understand we can modify the filesystem connection URI to point to s3a rather than hdfs and configuring the s3 endpoint and credentials should be sufficient enough to make my current application to work on top of S3, is my understanding right?

Updated:

I'm facing the below error while running my application, not sure where I'm going wrong with the configuration

Connection Handler:

public static FileSystem getConnection() throws IOException {
        if (hdfsConnInstance == null) {
            config = new Configuration();
//          String Uri = "hdfs://" + cluster + "/";
            String Uri = "s3a://tmpBkt/";
            config.set("fs.defaultFS", Uri);
            config.set("fs.s3a.access.key", "EXAMPLE");
            config.set("fs.s3a.secret.key", "EXAMPLEKEY");
            config.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
            config.set("fs.s3a.endpoint", "http://127.0.0.1:9000");
            fs = FileSystem.get(URI.create(Uri), config);
        }
        return fs;
    }

Error :

java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
        at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:217)
        at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2624)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2634)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at com.subash.hdfshandler.HdfsConnectionHandler.getConnection(HdfsConnectionHandler.java:29)
        at com.subash.datagenerator.DataHandler.run(DataHandler.java:27)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

Copyright Notice:Content Author:「Subash Kunjupillai」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/69469686/hdfs-application-to-s3-using-s3a-connector

More about “HDFS application to S3 using S3a connector” related questions

HDFS application to S3 using S3a connector

I'm trying to understand the capability of S3a connector here for the use case where I have to run my current HDFS based application over a S3 storage without much change to the application. On quick

Show Detail

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command. hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://...

Show Detail

How do I get Hive 2.2.1 to successfully integrate with AWS S3 using "s3a://" scheme

I've followed various published documentation on integrating Apache Hive 2.1.1 with AWS S3 using the s3a:// scheme, configuring fs.s3a.access.key and fs.s3a.secret.key for hadoop/etc/hadoop/core-s...

Show Detail

How does YARN decide to create how many containers? (Why the difference between S3a and HDFS?)

I'm using the current version of Hadoop, and running some TestDFSIO benchmarks (v. 1.8) to compare the cases where the default file system is HDFS versus the default file system is an S3 bucket (us...

Show Detail

Nutch 1.x: How to use s3a instead of HDFS?

I have read the official Apache Nutch with Hadoop tutorial. It talks about using HDFS to store Nutch's data when using Nutch in "deploy mode" (i.e. within a Hadoop cluster). When using Hadoop, I p...

Show Detail

Hadoop server connection for copying files from HDFS to AWS S3

Requirement is to copy hdfs files from Hadoop cluster(non-aws) to AWS S3 bucket with standalone java application scheduled with daily CRON. Would be using AmazonS3.copyObject() method for copying....

Show Detail

Hdfs to s3 Distcp - Access Keys

For copying the file from HDFS to S3 bucket I used the command hadoop distcp -Dfs.s3a.access.key=ACCESS_KEY_HERE\ -Dfs.s3a.secret.key=SECRET_KEY_HERE /path/in/hdfs s3a:/BUCKET NAME But the acces...

Show Detail

Copy Files from AWS S3 to HDFS (Hadoop Distributed File System)

I'm trying to copy AVRO files from AWS S3 bucket to HDFS using the following Scala code: val avroDF = sparkSession.read.format("com.databricks.spark.avro").load("s3a://"+s3Location+"/") avroDF.wr.

Show Detail

Distcp from S3 to HDFS

Im trying to copy data from S3 to HDFS using distcp tool. Problem with that is, that S3 cluster uses VPC endpoint and I dont know how to properly configure distcp. I have trtied several configurati...

Show Detail

Retrieving Latest Object Version from s3a bucket

Is there any way to get the latest object version from an s3 versioning enabled bucket, using s3a connector, with the help of hadoop cli?

Show Detail