0

I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:

import pandas as pd
from hdfs import InsecureClient
import os

file = open ("test.txt", "wb")

print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
    print('new line')
    features = reader.read(1000000)
    file.write(features)
    print('end')
file.close()

My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that: txt_file

In HDFS it looks like this: hdfs directory

My question now is: Is it possible to get the data separated for each column in a senseful way?

I only found solutions with .csv files and like that and somehow stuck here... :-)

EDIT I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:

import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive


#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')

#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)

#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)

#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)

#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')

#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")


#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)

#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)

#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
 #   df = pd.read_parquet(f)


#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
   features = pd.read_parquet(reader)
   print (features)
   #features = reader.read()
   #data = features.decode('utf-8', 'replace')
   print("saving data to file")
   file.write(data)
   print('end')

file.close()
Aquen
  • 267
  • 1
  • 2
  • 16
  • You're not meant to be able to read the file blocks as plaintext. Your files are owned by Hive group, so are they part of a Hive table? – OneCricketeer Aug 16 '18 at 14:20
  • Our provider isn't quite communicative about it all but yeah I think so. At least I can say (when browsing the directory) that it's like that: /user/hive/XYZ.db/complaint – Aquen Aug 16 '18 at 15:13
  • Can you connect to Hive and run `SHOW CREATE TABLE XYZ.complaint`? – OneCricketeer Aug 16 '18 at 15:51
  • We have no access to the node via putty or something like that. Only the web interface to lookup a little bit. Let's assume it is a hive table indeed (which would make the most sense in my opinion), what would be the way to achieve my problem? – Aquen Aug 16 '18 at 16:18
  • You can use `pyhive` or `pyspark` to connect to hive. Or use other JDBC/ODBC tools. You don't need an SSH session. Anyway, my point here that if the data is in Hive, it's not necessarily plaintext that Python can just read. It could be ORC/Parquet, or something else – OneCricketeer Aug 16 '18 at 18:21
  • Ok got a reply from the supplier. some tables (like those above) are stored as .txt and some as a parquet file – Aquen Aug 17 '18 at 06:33
  • If they are stored as text, then I think your current code should work. Otherwise, you would need a library capable of read parquet files. Therefore, as mentioned, Spark or Hive (or Pandas+SQL+PyHive) are reasonable options – OneCricketeer Aug 17 '18 at 13:55
  • Hi, ok thanks I'll give it a try and report :-) – Aquen Aug 18 '18 at 19:00
  • So I tried different approaches in receiving data mut none of them really worked. I've edited the post above and made some remarks where the errors are (and what the error says). Maybe you can help here? I'm aware that I possibly make curcial mistakes somewhere, but I can't really figure out what :-/ – Aquen Aug 20 '18 at 08:21
  • For Spark, you're probably using YARN, not a standalone Spark master. You would need to use `setMaster('yarn")` (which requires additional setup for YARN and to know that Hive needs to be used, or how to reach HDFS, you can't just run that after installing pyspark or extracting from the Spark website). For Hive connection, it doesn't use port 8020 or the namenode, you need to use the address of a HiveServer with port 10000 https://stackoverflow.com/a/45689705/2308683 PyArrow looks like it should work, but I've not use it – OneCricketeer Aug 20 '18 at 12:55

0 Answers0