-1

I have a file on my server at location

/user/data/abc.csv

I need to create a hive table on top of this data in the file. So i need to move this file to hdfs location

/user/hive/warehouse/xyz.db

How can we do that using python?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Aakib
  • 79
  • 1
  • 14
  • https://stackoverflow.com/a/51548097/2308683 – OneCricketeer Aug 09 '18 at 22:12
  • You could use PySpark to read your local file, and write to a Hive table. – OneCricketeer Aug 09 '18 at 22:12
  • @cricket_007 I want to implement it using pyspark, I plan to create a hive table on top of my file . That's the reason I want to move it from my server location to hdfs location. I can write a shell command -copyFromLocal but I want to do this using python in pyspark. How do I do it? – Aakib Aug 11 '18 at 00:53
  • Again, see the first link... Big long list of Python libraries to interact with HDFS. However, `saveAsTable` works fine in PySpark, so I ask - what have you tried? What errors are you getting? – OneCricketeer Aug 11 '18 at 03:40
  • Things that I tried subprocess.call(['hdfs', 'dfs', '-copyFromLocal', '/u/data/abc.csv', 'hdfs://user/hive/warehouse/class.db/abc.csv'], shell=True) Error : No alias specified and no default alias found. 1 2nd try: shutil.copy('/user/adam/data//abc.csv', 'hdfs://user/hive/warehouse/class.db/class/abc.csv') – Aakib Aug 11 '18 at 18:39
  • 'shutil` can't access HDFS paths. The first is correct, assuming `hdfs` command is on your OS `PATH`, but again, you've not tried Spark? – OneCricketeer Aug 11 '18 at 22:44
  • @cricket_007 I am writing this code after initiating my pyspark engine. – Aakib Aug 12 '18 at 01:52
  • Neither subprocess or shutil use a Spark context... Like I mentioned, you want to use a `saveAsTable` function from Spark – OneCricketeer Aug 12 '18 at 15:10

2 Answers2

0

First you need to retrieve the file from server. Use this pyhton code to retrieve it to your local machine.

import ftplib

path = '/user/data/'
filename = 'abc.csv'

ftp = ftplib.FTP("Server IP") 
ftp.login("UserName", "Password") 
ftp.cwd(path)
ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write) #Download the file from server to local on same path.
ftp.quit()

Once the file downloaded to local, then do usual hive query to Load data from local or put data into HDFS then load to hive.

Load data directly from local to hive:

LOAD DATA local INPATH '/user/data/abc.csv' into table <table name>; 

Load data to HDFS:

hadoop fs -copyFromLocal ~/user/data/abc.csv /your/hdfs/path

then load it to hive by using hive query.

LOAD DATA INPATH '/your/hdfs/path' into table <table name>;
ArunTnp
  • 74
  • 8
  • I dont want to write shell commands, I am within pyspark boundaries. So I need a python code to copy my file. – Aakib Aug 11 '18 at 00:54
-1

hadoop fs -put command can be used to put file from local file system to HDFS.

Prashant
  • 702
  • 6
  • 21