0

I have a data set with numeric data column which contains some text.

For example let's say I have a column of numbers from 1 to 10 but instead of 10 I have it in text as "ten".

I was trying to use unique() method on the column to identify the inconsistent data and clean the same.

The current data set is small so I am able to get away with it. But if the dataset is large it would be really difficult with my approach.

Is there a better way to do this?

Evgeny Minkevich
  • 2,319
  • 3
  • 28
  • 42

1 Answers1

1

Programming languages have no inherent concept that certain strings correspond to certain numbers, so you would have to programmatically parse each string and determine the corresponding number.

The best answer I could find for this part was here: Is there a way to convert number words to Integers? — but if you have a small number of discrete possible values your data can take on, say 1 to 10 inclusive, it's pretty feasible to create a dict and manually create a one-to-one mapping of strings to numbers. It's not the prettiest way, but it will work. Note that this is not a scalable solution as the range of possible values increases.

Once you have a mapping dictionary, which we'll call str_to_int_dict, you can simply do:

df[column_name] = df[column_name].apply(lambda x: str_to_int_dict[x])

Now, this is assuming that your strings are consistently formatted: that is, 10 always corresponds to "ten", and not some mix of "ten", "Ten", "tEN", " tEn ", etc. If you have that sort of eclectic mix, it might be good to instead chain .apply(lambda x: str_to_int_dict[x.lower().strip()] (or upper, it's about consistency).

chang_trenton
  • 827
  • 6
  • 10