0

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows."ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()  

But using sc.textFile, I get the correct number of rows

 data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv")
schema= data.map(lambda x: x.split(",")).first()  #get schema
header = data.first()                          # extract header
data=data.filter(lambda x:x !=header)          # filter out header

data= data.map(lambda x: x.split(","))
data.count()
3641865
Fisseha Berhane
  • 2,533
  • 4
  • 30
  • 48
  • I'm not sure this question needs the jupyter tag. You can run the same code in the spark shell. – OneCricketeer Aug 02 '16 at 19:08
  • Oh, and PS, you should 1) not give the actual address of the cluster and 2) You **really** should change the default login ;) – OneCricketeer Aug 02 '16 at 19:10
  • Thanks. Actually, that happened when I copied the code form Jupiter. – Fisseha Berhane Aug 02 '16 at 19:34
  • Are you able to just `sc.textFile` the file? Maybe you just aren't loading it into the sqlContext correctly – OneCricketeer Aug 02 '16 at 19:39
  • Thanks again. I checked it and it gives the correct number of rows. data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv") schema= data.map(lambda x: x.split(",")).first() #get schema header = data.first() # extract header data=data.filter(lambda x:x !=header) # filter out header data= data.map(lambda x: x.split(",")) data.count() 3641865 – Fisseha Berhane Aug 02 '16 at 19:45
  • Does `demography.printSchema()` give you the expected schema? – OneCricketeer Aug 02 '16 at 19:58
  • It gives the column names but all are strings . Actually though, some of them should be integers. – Fisseha Berhane Aug 02 '16 at 20:10
  • I think you need to use `sqlContext.read.csv(path, header="true", inferSchema="true")` – OneCricketeer Aug 02 '16 at 20:39
  • Well, you seem to be missing `from pyspark import HiveContext`. Other than that, your code looks good, I have the exact same code written myself and it works. I would think this is a data issue, try creating a simple new csv file and `-put` it in hadoop then try to read it using the same code. If it works, the data is the issue. Perhaps the new line delimiter is the issue? – user3124181 Aug 02 '16 at 21:27
  • SO keeps all edits. You need to change the default login immediately (and maybe add a few rules to the Security Group). – shuaiyuancn Aug 03 '16 at 14:21

1 Answers1

0

The answer by Indrajit given here solved my problem. The problem was with the spark-csv jar.

Community
  • 1
  • 1
Fisseha Berhane
  • 2,533
  • 4
  • 30
  • 48