Reading data from HDFS on a cluster

Question

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows."ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()

But using sc.textFile, I get the correct number of rows

 data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv")
schema= data.map(lambda x: x.split(",")).first()  #get schema
header = data.first()                          # extract header
data=data.filter(lambda x:x !=header)          # filter out header

data= data.map(lambda x: x.split(","))
data.count()
3641865

I'm not sure this question needs the jupyter tag. You can run the same code in the spark shell. — OneCricketeer, Aug 02 '16 at 19:08
Oh, and PS, you should 1) not give the actual address of the cluster and 2) You **really** should change the default login ;) — OneCricketeer, Aug 02 '16 at 19:10
Thanks. Actually, that happened when I copied the code form Jupiter. — Fisseha Berhane, Aug 02 '16 at 19:34
Are you able to just `sc.textFile` the file? Maybe you just aren't loading it into the sqlContext correctly — OneCricketeer, Aug 02 '16 at 19:39
Thanks again. I checked it and it gives the correct number of rows. data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv") schema= data.map(lambda x: x.split(",")).first() #get schema header = data.first() # extract header data=data.filter(lambda x:x !=header) # filter out header data= data.map(lambda x: x.split(",")) data.count() 3641865 — Fisseha Berhane, Aug 02 '16 at 19:45
Does `demography.printSchema()` give you the expected schema? — OneCricketeer, Aug 02 '16 at 19:58
It gives the column names but all are strings . Actually though, some of them should be integers. — Fisseha Berhane, Aug 02 '16 at 20:10
I think you need to use `sqlContext.read.csv(path, header="true", inferSchema="true")` — OneCricketeer, Aug 02 '16 at 20:39
Well, you seem to be missing `from pyspark import HiveContext`. Other than that, your code looks good, I have the exact same code written myself and it works. I would think this is a data issue, try creating a simple new csv file and `-put` it in hadoop then try to read it using the same code. If it works, the data is the issue. Perhaps the new line delimiter is the issue? — user3124181, Aug 02 '16 at 21:27
SO keeps all edits. You need to change the default login immediately (and maybe add a few rules to the Security Group). — shuaiyuancn, Aug 03 '16 at 14:21

score 0 · Accepted Answer · edited May 23 '17 at 12:33

0

The answer by Indrajit given here solved my problem. The problem was with the spark-csv jar.

edited May 23 '17 at 12:33

Community

1
1

answered Aug 08 '16 at 18:59

Fisseha Berhane

2,533
4
30
48

Reading data from HDFS on a cluster

1 Answers1