0

I have a dataset in raw text file(its a log file),I am preparing python list using this text file reading line by line,with that list i will create a dataframe using pyspark .if you see the dataset ,some value are missing in respective column,i want to fill it with "NA".This is sample of Dataset,missing value can be in any column,column are separated by white space

==============================================
empcode   Emnname   Date       DESC
12d      sf        2018-02-06      dghsjf  
asf2     asdfw2    2018-02-16      fsfsfg  
dsf21    sdf2      2016-02-06      sdgfsgf
sdgg     dsds      dkfd-sffddfdf   aaaa
dfd      gfg       dfsdffd         aaaa
df                 dfdf            efef
4fr                                freff         
----------------------------------------------

MyCode:

path="something/demo.txt"
EndStr="----------------------------------------------"
FilterStr="=============================================="
findStr="empcode   Emnname"

def PrepareList(findStr):
  with open(path) as f:
    out=[]
    for line in f:
        if line.rstrip()==Findstr:
            #print(line)
            tmp=[]
            tmp.append(re.sub("\s+",",",line.strip()))
            #print(tmp)
            for line in f:
                if line.rstrip()==EndStr:
                    out.append(tmp)
                    break

                tmp.append(re.sub("\s+",",",line.strip()))
            return (tmp)
  f.close()  
LstEmp=[]
LstEmp=prepareDataset("empcode   Emnname   Dept   DESC")
print(LstEmp)

My output is:

['empcode,Emnname,Date,DESC', 
 '12d,sf,2018-02-06,dghsjf',
 'asf2,asdfw2,2018-02-16,fsfsfg',
 'dsf21,sdf2,2016-02-06,sdgfsgf',
 'sdgg,dsds,dkfd-sffddfdf,aaaa',
 'dfd,gfg,dfsdffd,aaaa',
 'df,dfdf,efef',
 '4fr,freff']  

Expected output:

['empcode,Emnname,Date,DESC', 
 '12d,sf,2018-02-06,dghsjf',
 'asf2,asdfw2,2018-02-16,fsfsfg',
 'dsf21,sdf2,2016-02-06,sdgfsgf',
 'sdgg,dsds,dkfd-sffddfdf,aaaa',
 'dfd,gfg,dfsdffd,aaaa',
 'df,NA,dfdf,efef',
 '4fr,NA,NA,freff']
Rudrashis
  • 7
  • 1
  • 6
  • It looks like you have a fixed with format file. Take a look at [this question](https://stackoverflow.com/questions/41944689/pyspark-parse-text-file) which you may find helpful. Essentially you read the file as one big column, and then split the text by string position. – pault May 18 '18 at 20:00
  • @pault Thanks for your comment...no it is different...my case there is space between 2 column,but column don't have fixed length. – Rudrashis May 18 '18 at 20:13

2 Answers2

0

From the dataset it appears that the text in fields is variable in length, the fields themselves start and end at fixed position. This usually happens with tab separated fields.

==============================================
empcode   Emnname   Date       DESC
12d      sf        2018-02-06      dghsjf  
asf2     asdfw2    2018-02-16      fsfsfg  
dsf21    sdf2      2016-02-06      sdgfsgf

If this is the case the following should work:

for line in f:
    if line.rstrip()==Findstr:
        tmp=[]
        tmp.append(re.sub("\t",",",line.strip()))
        #print(tmp)
        for line in f:
            if line.rstrip()==EndStr:
                out.append(tmp)
                break

            tmp.append(re.sub("\t",",",line.strip()))
        return (tmp)

I have replaced \s in your code with \t and removed +. In python regex, + sign expands to match one or more occurrences of regex preceding it. In this case it is \s which expands from the end of first field to the next field.

Alternatively, if the input file is not tab separated, you can extract field values considering fixed length field and then doing a strip()

fields = [ (0,10),
           (10, 20),
           (20,36),
           (36,100) # Assuming last field will not cross this length
         ]
field_values = [ line[ x[0]:x[1] ].strip() for x in fields ]
adityaj
  • 3
  • 4
  • thanks for your response!!! yes i am getting the value using your code that is below : ['empcode Emnname Date DESC', '12d sf 2018-02-06 dghsjf', 'asf2 asdfw2 2018-02-16 fsfsfg', 'dsf21 sdf2 2016-02-06 sdgfsgf', 'sdgg dsds dkfd-sffddfdf aaaa', 'dfd gfg dfsdffd aaaa', 'df dfdf efef', '4fr freff'] each individual line as a string .how an create dataframe with this and how i an able to use individual column for next calulation.. – Rudrashis May 19 '18 at 20:04
0

Here I tried to follow a general approach, where you won't have to pre program the column spans in your code. For returning dataframe you can use pd.read_csv with stringio. Kindly modify the path as per your file location. And this code is extended from your code to make it simple for you to understand, otherwise there are more beutiful ways to write the same logic

    import re
import pandas as pd
import StringIO
path = "/home/clik/clik/demo.txt"
EndStr = "------------------------------"
FilterStr = "=================="
FindStr = "empcode   Emnname"


def match(sp1, sp2):
    disjunct = max(sp1[0] - sp2[1], sp2[0] - sp1[1])
    if disjunct >= 0:
        return -abs((sp1[0]+sp1[1])/2.0 - (sp2[0]+sp2[1])/2.0)
    return float(disjunct) / min(sp1[0] - sp2[1], sp2[0] - sp1[1])


def PrepareList():
    with open(path) as f:
        out = []
        for i, line in enumerate(f):
            print line.rstrip()
            if line.rstrip().startswith(FindStr):
                print(line)
                tmp = []
                col_spans = [m.span() for m in re.finditer("[^\s][^\s]+", line)]
                tmp.append(re.sub("\s+", ",", line.strip()))
                # print(tmp)
                for line in f:
                    if line.rstrip().startswith(EndStr):
                        out.append(tmp)
                        break
                    row = [None] * len(col_spans)
                    for m in re.finditer("[^\s][^\s]+", line):
                        colmatches = [match(m.span(), cspan) for cspan in col_spans]
                        max_index = max(enumerate(colmatches), key=lambda e: e[1])[0]
                        row[max_index] = m.group() if row[max_index] is None else (row[max_index] + ' ' + m.group())
                    tmp.append(','.join(['NA' if e is None else e for e in row]))
                    #tmp.append(re.sub("\s+", ",", line.strip()))
                #for pandas dataframe
                #return pd.read_csv(StringIO.StringIO('\n'.join(tmp)))

                #for returning list of tuples
                return map(tuple, tmp)
                #for returning list of list
                #return tmp
    f.close()


LstEmp = PrepareList()

for coverting list of tuples to pyspark dataframe, here is a tutorial http://bigdataplaybook.blogspot.in/2017/01/create-dataframe-from-list-of-tuples.html

Alok Nayak
  • 2,381
  • 22
  • 28
  • Thanks for your response .I am getting error when i am calling function "PrepareList" . Code: LstEmp = [] LstEmp = PrepareList("empcode Emnname Date DESC") getting error : disjunct = max(sp1[0] - sp2[1], sp2[0] - sp1[1]) TypeError: _() takes exactly 1 argument (2 given) – Rudrashis May 21 '18 at 06:53
  • Thank you very much it is working... I want to update little bit on this code please help me on this ... 1.How it will work or what i have to change if there is 2 word in a particular column value like: in Emnname column value in particular row is "Name Surname"(example :Alok Nayak), current code only taking last word how handle this situation. 2. After header there is a line "==============================================" how to filter that line ... – Rudrashis May 21 '18 at 10:56
  • I solve the 2nd problem , so please ignore 2nd one...can you please help me with 1st point. – Rudrashis May 21 '18 at 14:34
  • @Rudrashis can you provide example demo.txt for me to write workable solution for this problem. you can edit the question if its not possible to comment this here. – Alok Nayak May 22 '18 at 04:47
  • Assuming that same column field will always be separated by 1 space while fields in different columns are separated by more than 1 space, I have updated the header and field regex to accommodate single space separated header and field – Alok Nayak May 22 '18 at 05:01
  • Or you can keep the regex same and update row[maxindex] if it already got assigned.Previously I was replacing it – Alok Nayak May 22 '18 at 05:35
  • Thanks for your response. I just kept last program and change 3 line what you have change this program. colmatches = [match(m.span(), cspan) for cspan in col_spans] max_index = max(enumerate(colmatches), key=lambda e: e[1])[0] row[max_index] = m.group() if row[max_index] is None else (row[max_index] + ' ' + m.group())... – Rudrashis May 24 '18 at 06:48
  • how can do this using pyspark – Rudrashis May 24 '18 at 06:58
  • I have provided link for converting list of tuples to pyspark, i hope it help – Alok Nayak May 25 '18 at 08:58