The data is as below.
[Row(_c0='ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD '),
Row(_c0='ACW00011647 17.1333 -61.7833 19.2 ST JOHNS '),
Row(_c0='AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP GSN 41196')]
I have defined the schema_stn with correct column widths etc as per the documentation. My code for reading it into a dataframe using pyspark is as under:
df.select(
df.value.substr(1, 11).alias('id'),
df.value.substr(13, 20).alias('LATITUDE'),
df.value.substr(22, 30).alias('LONGITUDE'),
df.value.substr(32, 37).alias('LATITUDE'),
df.value.substr(39, 40).alias('LONGITUDE'),
df.value.substr(42, 71).alias('LATITUDE'),
df.value.substr(73, 75).alias('LONGITUDE'),
df.value.substr(77, 79).alias('LATITUDE'),
df.value.substr(81, 85).alias('LONGITUDE'))
df = sqlContext.read.csv("hdfs:////data/stn")
df = (sqlContext.read.format("csv")
.schema(schema_stn)
.option("delimiter", " ")
.load("hdfs:////data/stn")
)
df.cache()
df.show(3)
I am getting following output.
In [62]: df.show(3)
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
| ID|LATITUDE|LONGITUDE|ELEVATION| STATE| NAME|GSN FLAG|HCN/CRN FLAG|WMO ID|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|ACW00011604| null| 17.1167| null|-61.7833| null| null| 10.1| null|
|ACW00011647| null| 17.1333| null|-61.7833| null| null| 19.2| null|
|AE000041196| null| 25.333| null| null|55.5170| null| null| 34.0|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
I am not able to remove these 'null' (which represent the whitespace.) What am missing here please.