-1

I have a dataframe with missing values. How can I write either a python or an R code to replace empty spaces with 0, a single string with 1, and multiple strings joined by "\t" with a number corresponding to how many "\t"s + 1.

my data frame:

        col1    col2    col3
row1    5blue   2green5 white
row2            white   green\twhite3\t3blue5
row3    blue3           white
row4    7blue   green2  
row5            3green  3white6
row6    6blue   green\t6white7  green   
row7    5blue5  6green  white
row8    blue6

Output expected:

        col1    col2    col3
row1    1   1   1
row2    0   1   3
row3    1   0   1
row4    1   1   0   
row5    0   1   1
row6    1   2   1   
row7    1   1   1
row8    1   0   0   

Any ideas? Thanks

x squared
  • 3,173
  • 1
  • 26
  • 41
user27976
  • 903
  • 3
  • 17
  • 28

3 Answers3

2

Parsing Tab Delimited

Read this post above. It covers using python csv module to parse tab delimited. I think it will help you.

Input File data_frame.txt

5blue   2green5 white
    white   green\twhite3\t3blue5
blue3       white
7blue   green2  
    3green  3white6
6blue   green\t6white7  green
5blue5  6green  white

The code below

import csv

data_frame = open('data_frame.txt','r')             ## create input file for dataframe
output_matrix = []                                  ## output matrix
reader = csv.reader(data_frame, dialect="excel-tab")  ## Setup tab delimter file

for line in reader:                                 ## Read each line in the data frame
    out_line = []                                   ## Setup temp out-line var
    for item in line:

        if item == '':                              ## If item in line is null then put zero
            out_line.append(0)
        elif r"""\t""" in item:                     ## if item in line contains a "\t" character then put count + 1
            out_line.append(item.count(r"""\t""")+1)
        else:                                       ## Else item is 1
            out_line.append(1)
    output_matrix.append(out_line)                  ## Append line into output matrix

for line in output_matrix:
     print line                     ## Print output matrix

This code should work... you just have to output the output_matrix to a csv file.

Output

[1, 1, 1]
[0, 1, 3]
[1, 0, 1]
[1, 1, 0]
[0, 1, 1]
[1, 2, 1]
[1, 1, 1]
Community
  • 1
  • 1
bud
  • 485
  • 6
  • 22
  • Just delete the row and column headers if they exist in your csv file. Otherwise, you can modify my code to incorporate them. – bud Aug 18 '15 at 21:48
  • Straightforward python code! Thanks a lot @budder. I have gained more python knowledge on how to deal with this kind of situation. – user27976 Aug 19 '15 at 22:32
2

I'm using a function that goes to each column element and checks if the element is a space (You can change that depending on what you have. It looks likes a space to me) and returns 0 if it is, otherwise it splits the string by "\t" and counts the strings that are produced.

# example dataset
dt = data.frame(col1 = c("green\twhite3\t3blue5","green"),
                col2 = c(" ", "green\twhite3"), stringsAsFactors = F)

dt

#                   col1         col2
# 1 green\twhite3\t3blue5             
# 2               green green\twhite3


ff = function(x) 
{
  res = vector()                                                             # create an empty vector to store counts for each element
  for (i in 1:length(x)){                                                    # iterate through each element
        res[i] = ifelse(x[i]==" ", 0, length(unlist(strsplit(x[i],"\t"))))   # if the element is space return 0, else split string by \t and count new strings
                        }
  return(res)                                                                # return the stored values
}


data.frame(sapply(dt, function(x) ff(x)))                                    # apply the function to all columns and save it as a data.frame

#     col1 col2
# 1    3    0
# 2    1    2
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
0

Use the yourstring.count("\t") function to get the number of tabs, add 1 to the value to get the number of words. If string is empty, output 0.

Sylver
  • 11
  • 1