How to clean data without pandas or numpy?

Question

I have to use python to clean data for easier analysis, however, I cannot use pandas in this case the requirement and expectation are as below:

actual = preprocess([
            ('Survived', 'Pclass', 'Name', 'Gender', 'Age', 'Fare'),
            ('no', '3', 'Braund Mr. Owen Harris', 'male', '22', '7.25'),
            ('Dead', '3', 'Braund Ms. Maria', 'Female', '21', ''),
            ('Yes', '1', 'Cumings Mrs. John Bradley (Florence Briggs Thayer)', 'F', '38', '71.28'),
            ('', '3', 'Vander Planke Miss. Augusta', 'female', '', ''),
            ('Dead', '4', 'Lennon Mr. Denis', 'male', '13', '15.5')])

expected = (
                ('Survived', 'Pclass', 'Name', 'Gender', 'Age', 'Fare'),
                [
                    (False, 3, 'Braund Mr. Owen Harris', 'male', 22.0, 7.25),
                    (False, 3, 'Braund Ms. Maria', 'female', 21.0, 25.0),
                    (True, 1, 'Cumings Mrs. John Bradley (Florence Briggs Thayer)', 'female', 38.0, 71.28),
                    ('', 3, 'Vander Planke Miss. Augusta', 'female', '', 25.0), 
                    (False, 4, 'Lennon Mr. Denis', 'male', 13.0, 15.5)]
                ]
           )

Can you please give some advises in this case?

Not enough info. Why are the 4th and the 5th data lines not included in the expected output? What are your rules for converting the 1st column to a boolean value? Etc... — , Nov 02 '20 at 17:09
sorry, I probably miss out the last row. the rules are: (1) Survived - boolean, (2) Plass - int, (3) Name - string, (4) Gender - string (male and female only), (5) age - float, (6) fare - float — Pham Phuong Anh, Nov 02 '20 at 17:11
I can figure out that the 1st column is to be converted into a boolean. What I was asking is your rules for changing the strings to True or False. BTW, you're missing two rows in your expected output, not one. — , Nov 02 '20 at 17:18
Yes for True and no/dead for false. ('', 3, 'Vander Planke Miss. Augusta', 'female', '', 25.0), (False, 4, 'Lennon Mr. Denis', 'male', 13.0, 15.5)]) — Pham Phuong Anh, Nov 03 '20 at 08:36
What are the rules for substituting missing values? In your example, a missing 'Fare' was replaced by '25.0'. Is that a general rule? What's with missing ages or missing 'Survived' status? — buddemat, Nov 03 '20 at 12:25
This is quite broad/vague. Please see [ask], [help/on-topic]. — AMC, Nov 04 '20 at 02:05

buddemat · Answer 1 · 2020-11-05T08:53:16.543

Some information is missing (e.g. how to handle missing values? or should incomplete entries be filtered out? do you need the result to be sorted or not?), but I'll attempt an answer nonetheless:

To achieve your desired outcome of converting the type and/or content of your tuple 'columns', you need to map and/or typecast the old values to new ones.

Mapping the strings

You can do different approaches to map different input values to a range of output values. I'll use two in my answer: (1) if ... else statements and (2) a python way of doing a switch ... case statement.

In order to get a bool value from the different possible entries for 'Survived', I used approach (2). For this you set up a dictionary with your mappings and then get the appropriate entry from it for each of your candidates (see Replacements for switch statement in Python?). You can combine this with the string lower() function so that you can disregard case (How do I lowercase a string in Python?). You can add also a default value that should be used in case the key can not be found in the dict, in my example below, I use None.

Example:

entry = 'NO'

switcher_survived = {
    'no': False,
    'dead': False,
    'yes': True
}

result = switcher_survived.get(entry.lower(), None)

The same approach can be used to set the gender based on the different input possibilities.

Typecasting

For the numbers, you can simply cast them to the desired type. However, this will only work if the string contains a number that can successfully be cast. Note that in your example, you have entries with an empty string which will lead to a ValueError when you try to cast. So you need to check that and may again want default to some value. I use nan = float('NaN') as this is a nice way of maintainig the correct type without using additional packages (see Assigning a variable NaN in python without numpy).

Example:

nan = float('NaN') 

entry = '2.5'

result = (float(entry) if float(entry) != "" else nan)

I'm using a one-line if-then-else statement here (see Putting a simple if-then-else statement on one line), because that is beneficial for the full example at the end.

Putting it together

actual = [
        ('Survived', 'Pclass', 'Name', 'Gender', 'Age', 'Fare'),
        ('no', '3', 'Braund Mr. Owen Harris', 'male', '22', '7.25'),
        ('Dead', '3', 'Braund Ms. Maria', 'Female', '21', ''),
        ('Yes', '1', 'Cumings Mrs. John Bradley (Florence Briggs Thayer)', 'F', '38', '71.28'),
        ('', '3', 'Vander Planke Miss. Augusta', 'female', '', ''),
        ('Dead', '4', 'Lennon Mr. Denis', 'male', '13', '15.5')]


nan = float('NaN')

switcher_survived = {
    'no': False,
    'dead': False,
    'yes': True
}

switcher_gender = {
    'male': 'male',
    'm': 'male',
    'female': 'female',
    'f': 'female'
}

def process(lst):
    result = []
    current = 1
    while current < len(lst):
        tuple = (switcher_survived.get(lst[current][0].lower(),''),
                 int(lst[current][1]),
                 lst[current][2],
                 switcher_gender.get(lst[current][3].lower(),''),
                 (float(lst[current][4]) if lst[current][4] != "" else ''),
                 (float(lst[current][5]) if lst[current][5] != "" else 25.0)
                )
        result.append(tuple)
        current += 1
    return [lst[0], result]

expected = process(actual)

print(expected)

Some remarks:

In this final example, I have changed the default value for the column 'Fare' to 25.0, as to conform with your expected outcome.
For the same reason, I have also changed the default values for 'Survived', 'Gender' and 'Age' to the empty string '' instead of None respectively NaN. Please note that this violates your own requirements, as the empty string is obviously not of type bool or float. This may have implications when you work with the data later. Especially, the empty string in column 'Survived' may be silently evaluated to False.
To filter out incomplete data, you could change the default values back to None and NaN and only add complete rows to your final data set. For that, you could check if any of the tuples' fields are None (see What is the best way to check if a tuple has any empty/None values in Python?):
```
     if not any(map(lambda x: (x is None) or (x is nan), tuple)):
         result.append(tuple)
```
If you wanted to sort the list by an arbitrary column, you could use a lambda fuction as sortkey (see Syntax behind sorted(key=lambda: ...)) before you return the result. E.g. to sort by the name:
```
 result = sorted(result, key=lambda tuple: tuple[2])
```

How to clean data without pandas or numpy?

1 Answers1