0

I have a 100k+ row spreadsheet with duplicates that aren't quite duplicates. For example, one cell will have the following content:

ASP.NET.Site.Performance.Secrets..Simple.and.Proven.Techniques.to.Quickly.Speed.up.Your.ASP>.NET.Web.Site

Sometimes the following cell (or 2, or 3) will have the same exact content except it will have a number at the end:

ASP.NET.Site.Performance.Secrets..Simple.and.Proven.Techniques.to.Quickly.Speed.up.Your.ASP>.NET.Web.Site.1

ASP.NET.Site.Performance.Secrets..Simple.and.Proven.Techniques.to.Quickly.Speed.up.Your.ASP>.NET.Web.Site.2

How can I efficiently eliminate these duplicates? Thanks for any help you can offer.

Community
  • 1
  • 1
  • Is the number always one digit or can it be more? – EndlessLoop Feb 25 '14 at 23:29
  • Is there a rule you can provide us for what the extensions could be? I understand ".1", ".2", but what if there are more than 9? Is it ".10" or "1.0", etc? – Enigmativity Feb 25 '14 at 23:45
  • Always one as far as I can tell. I haven't examined all of the rows obviously, but it's always followed the pattern above and I've never seen it surpass ".3" – user3304229 Feb 25 '14 at 23:59
  • I have no idea if 10 would be .10 or 1.0. Another potential issue is that there are valid instances of ".1" ".2" etc. as every term in each cell is separated by a dot, as seen in my examples above, and numbers will be in some these book titles. – user3304229 Feb 26 '14 at 00:04

1 Answers1

1

Assuming the data is in column A, Create a new calculated column as follows:

=IF(ISNUMBER(VALUE(RIGHT(A2,LEN(A2)-FIND("@",SUBSTITUTE(A2,".","@",(LEN(A2)-LEN(SUBSTITUTE(A2,".","")))/LEN(".")))))),LEFT(A2,FIND("@",SUBSTITUTE(A2,".","@",(LEN(A2)-LEN(SUBSTITUTE(A2,".","")))/LEN(".")))-1),A2)

It's ugly. The formula checks if the characters after the last period are a number. If so, it strips them and the period off. If not, it just returns the original field.

Credit to Excel: last character/string match in a string for how to get the last occurrence of a character in a string.

EDIT: Based on it always being a single digit, the function could be simplified.

=IF(AND(ISNUMBER(VALUE(RIGHT(A1,1))),MID(A1,LEN(A1)-1,1)="."),LEFT(A1,LEN(A1)-2),A1)

This checks that the last character is a number AND that the second last character is a "."

Community
  • 1
  • 1
Ken
  • 1,156
  • 1
  • 6
  • 8
  • This would be ideal as I only want to strip them if they are the last characters in the cell. I will try this tonight or tomorrow and let you know how it goes. Thanks! – user3304229 Feb 26 '14 at 00:05
  • Just be careful with this formula if any of your source strings contains a `@` as this throws it. You could try replacing the `@` with a non-standard character, like `ʩ` for example. – Enigmativity Feb 26 '14 at 00:58
  • Hey, I tried the second formula and it seems to work. It gives me a new column, C, and the duplicates are now exact duplicates. However, when I remove duplicates, it removes data from column B. I don't get why it's doing that?? Thanks though! – user3304229 Feb 26 '14 at 15:18
  • I'm not sure what you mean. I just tried it with the sample data, and it deleted the two duplicate rows. Did you tell it that only the calculated column contained duplicates, and you had selected the entire table? – Ken Feb 26 '14 at 15:23
  • I'm not sure what's happening. The calculated column, C, has no duplicates. But my original data has values ranging from .net-zoho in column B. After I remove the duplicates, column B ends with "r." I want the original column B values attached to the new content in column C. Sorry if I'm not explaining this very well. – user3304229 Feb 26 '14 at 15:41
  • Sorry, I am not following what you mean. – Ken Feb 26 '14 at 16:03
  • When I tried to remove duplicate rows, it was mismatching data between the columns. However, I went to the 'advanced' tab under sort and filter and checked 'unique records only' and now I have what I need. Thanks again! – user3304229 Feb 26 '14 at 18:38
  • Glad I could help. Can you accept the answer to close it off? – Ken Feb 26 '14 at 18:54