I have a dataset in raw text file(its a log file),I am preparing python list using this text file reading line by line,with that list i will create a dataframe using pyspark .if you see the dataset ,some value are missing in respective column,i want to fill it with "NA".This is sample of Dataset,missing value can be in any column,column are separated by white space
==============================================
empcode Emnname Date DESC
12d sf 2018-02-06 dghsjf
asf2 asdfw2 2018-02-16 fsfsfg
dsf21 sdf2 2016-02-06 sdgfsgf
sdgg dsds dkfd-sffddfdf aaaa
dfd gfg dfsdffd aaaa
df dfdf efef
4fr freff
----------------------------------------------
MyCode:
path="something/demo.txt"
EndStr="----------------------------------------------"
FilterStr="=============================================="
findStr="empcode Emnname"
def PrepareList(findStr):
with open(path) as f:
out=[]
for line in f:
if line.rstrip()==Findstr:
#print(line)
tmp=[]
tmp.append(re.sub("\s+",",",line.strip()))
#print(tmp)
for line in f:
if line.rstrip()==EndStr:
out.append(tmp)
break
tmp.append(re.sub("\s+",",",line.strip()))
return (tmp)
f.close()
LstEmp=[]
LstEmp=prepareDataset("empcode Emnname Dept DESC")
print(LstEmp)
My output is:
['empcode,Emnname,Date,DESC',
'12d,sf,2018-02-06,dghsjf',
'asf2,asdfw2,2018-02-16,fsfsfg',
'dsf21,sdf2,2016-02-06,sdgfsgf',
'sdgg,dsds,dkfd-sffddfdf,aaaa',
'dfd,gfg,dfsdffd,aaaa',
'df,dfdf,efef',
'4fr,freff']
Expected output:
['empcode,Emnname,Date,DESC',
'12d,sf,2018-02-06,dghsjf',
'asf2,asdfw2,2018-02-16,fsfsfg',
'dsf21,sdf2,2016-02-06,sdgfsgf',
'sdgg,dsds,dkfd-sffddfdf,aaaa',
'dfd,gfg,dfsdffd,aaaa',
'df,NA,dfdf,efef',
'4fr,NA,NA,freff']