How to read csv file with field containing multiple lines in pyspark

Question

I want to read a CSV file in which one field contains multiple line entry so whenever I split it using lambda function as shown below in the code and try to use different columns, the code gives me error like list index out of bound. My code

from pyspark import SparkContext, SparkConf
import spark

print("What you want to analyze:")
print("1. No. of patients from same hospital")
print("2. No. of patients from a particular city")
choice=int(input())

conf = SparkConf().setMaster("local").setAppName("Word_count")
conf.set("spark.driver.memory", "3G")
rdd=spark.read.option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").csv("/home/karan/HospInfo3.csv")
if(choice==1):
    print("No. of patients from same hospital is:-")
    diss=rdd.map(lambda x:x.split(",")).\
            map(lambda x:(x[1],1)).\
            reduceByKey(lambda x,y:x+y).\
            map(lambda x:(x[1],x[0])).\
            sortByKey(ascending=False).\
            map(lambda x:(x[1],x[0]))
    rs=diss.collect()
elif(choice==2):
    print("No. of patients from a particular city is:-")
    diss=rdd.map(lambda x:x.split(",")).\
            map(lambda x:(x[3],1)).\
            reduceByKey(lambda x,y:x+y).\
            map(lambda x:(x[1],x[0])).\
            sortByKey(ascending=False).\
            map(lambda x:(x[1],x[0]))
    rs=diss.collect()
if os.path.exists("/home/karan/output2"):
    shutil.rmtree("/home/karan/output2")
for x in rs:
    print(str(x[0])+"\t\t"+str(x[1]))
sc.parallelize(rs).saveAsTextFile("output2")

Also, I am having following error:-

Traceback (most recent call last):
  File "hospitalsDemo2.py", line 11, in <module>
    rdd=spark.read.option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").csv("/home/karan/HospInfo3.csv")
AttributeError: 'module' object has no attribute 'read'

And with the SparkCOntext method of reading it as a CSV file, it gives the following error(where all this started):-

Traceback (most recent call last):
  File "hospitalsDemo2.py", line 17, in <module>
    reduceByKey(lambda x,y:x+y).\
  File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 1640, in reduceByKey
    return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
  File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 1868, in combineByKey
    numPartitions = self._defaultReducePartitions()
  File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 2278, in _defaultReducePartitions
    return self.getNumPartitions()
  File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 2474, in getNumPartitions
    return self._prev_jrdd.partitions().size()
  File "/home/karan/.local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/karan/.local/lib/python2.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: Illegal file pattern: \k is not followed by '<' for named capturing group near index 7
\home\karan\hospInfo\.csv
       ^
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71)
    at org.apache.hadoop.fs.GlobFilter.<init>(GlobFilter.java:50)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:192)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.regex.PatternSyntaxException: \k is not followed by '<' for named capturing group near index 7
\home\karan\hospInfo\.csv
       ^
    at java.util.regex.Pattern.error(Pattern.java:1957)
    at java.util.regex.Pattern.escape(Pattern.java:2413)
    at java.util.regex.Pattern.atom(Pattern.java:2200)
    at java.util.regex.Pattern.sequence(Pattern.java:2132)
    at java.util.regex.Pattern.expr(Pattern.java:1998)
    at java.util.regex.Pattern.compile(Pattern.java:1698)
    at java.util.regex.Pattern.<init>(Pattern.java:1351)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)
    ... 29 more

CSV sample

CSV opened as a text file

Sample easy to use

Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Hospital Type,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location

10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,Acute Care Hospitals,Government - Hospital District or Authority,,Same as the national average,,Above the national average,,Below the national average,,"2505 US HIGHWAY 431 NORTH
 BOAZ, AL
 "

10012,DEKALB REGIONAL MEDICAL CENTER,200 MED CENTER DRIVE,FORT PAYNE,AL,35968,DE KALB,2568453150,Acute Care Hospitals,Proprietary,,Above the national average,,Same as the national average,,"200 MED CENTER DRIVE
 FORT PAYNE, AL
 "

10032,WEDOWEE HOSPITAL,209 NORTH MAIN STREET,WEDOWEE,AL,36278,RANDOLPH,2563572111,Acute Care Hospitals,Government - Hospital District or Authority,,Not Available,Results are not available for this reporting period,"209 NORTH MAIN STREET
 WEDOWEE, AL
 "

10131,CRESTWOOD MEDICAL CENTER,ONE HOSPITAL DR SE,HUNTSVILLE,AL,35801,MADISON,2568823100,Acute Care Hospitals,Proprietary,,Same as the national average,,Same as the national average,,"ONE HOSPITAL DR SE
 HUNTSVILLE, AL
 "

I copied your file to my system, its not tab delimited either. Pls follow this: https://meta.stackoverflow.com/questions/275458/how-do-i-paste-data-on-stack-overflow — steven, Sep 17 '18 at 06:41
@user3243478 I hope now it's clear. Sorry for posting it stupidly. — Karan Agarwal, Sep 17 '18 at 06:59
@KaranAgarwal: You can't just use `import spark`. You need to actually create a `SparkSession`. See for example this answer on how to do that+reading a csv: https://stackoverflow.com/a/41379096/7579547 — Shaido, Sep 17 '18 at 07:04
Ohh, Thank you @Shaido can you resolve this issue too.. csv() got an unexpected keyword argument 'multiline' — Karan Agarwal, Sep 17 '18 at 08:26
@KaranAgarwal: Try `.option("multiLine", True)`, note capital L and T. — Shaido, Sep 17 '18 at 08:29
@KaranAgarwal: Happy you solved it :) You can probably delete this question ^^ — Shaido, Sep 17 '18 at 09:00

How to read csv file with field containing multiple lines in pyspark

0 Answers0