Find lines with same characters set

Question

I have situation like this.

Car Driver
Cat Mouse 
Door House 
Driver Car

I need help with regex to find all lines with same set of characters or words no mater how placed in line.

Car Driver
Driver Car

Edited list:

A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <

Would `A0JLS3 Q9NUA2` be considered an "anagram" of `AJSQNA 0L39U2`, or is it just word re-ordering you're after? — ClickRick, Apr 13 '14 at 09:33
This lines can be considered as anagrams, yes. And yes, there are always 2 words in line. Consider it like representation of handshaking situation in room full of people. U can shake hands only with one human and at same time he is handshake with you. But for me is important is there handshake or not. Like Sam <> John and John <> Sam its same for me and i only need one interaction (handshake). Here 'words' represent proteins, and if they are in same line that mean they are interacting. — Maximus, Apr 13 '14 at 10:49

score 0 · Answer 1 · edited May 23 '17 at 12:05

0

I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:

Car Driver|Driver Car

Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions

And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.

edited May 23 '17 at 12:05

Community

1
1

answered Apr 11 '14 at 02:38

aliteralmind

19,847
17
77
108

I have many lines apart in text in which words or letters are scrambled like house and few lines apart useoh, or mouse_house and house_mouse. I want to find all that lines. – Maximus Apr 11 '14 at 02:45
Your `mouse_house` example fits my answer just fine. As far as scrambled letters, it can also be done in the same way, as long as there are a *known* and *small* number of letters: `mouse|ousem|usemo|semou|emous|...`. There are `5+4+3+2+1`, or `15` possible combinations. You could also do it with some *very* complicated regexes, if you really need to. – aliteralmind Apr 11 '14 at 02:56
Check out this question, which is what I mean: http://stackoverflow.com/questions/22152874/word-made-up-of-exactly-4-different-letters-using-regular-expressions – aliteralmind Apr 11 '14 at 02:59
Thanks for help, how i can write in regex Car Driver|Driver Car, becouse i have many of this exsamples, to be honest i need to check more then 150 000 lines for this kind of duplicates. – Maximus Apr 12 '14 at 10:20
(?:([A-Za-z]+) ([A-Za-z]+)|\2 \1) – aliteralmind Apr 12 '14 at 11:24
Consider trying different stuff out yourself. I'm still not certain of your requirements. The bottom section in the [FAQ](http://stackoverflow.com/a/22944075/2736496) has a list of online regex testers. – aliteralmind Apr 12 '14 at 21:00
I am trying for some time, i updated list of examples and marked em with <. This are protein interactions, so its same if a interact with b, and b with a, i need to find this lines to remove redundant pairs. Thenk u very much, man for your patience and time. – Maximus Apr 13 '14 at 09:24

Casimir et Hippolyte · Accepted Answer · 2014-04-15T03:28:09.363

0

EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:

^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)

where \h stand for an horizontal white-character.

Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:

^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)

Previous answer:

You can use this pattern:

^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)

This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.

pattern details:

^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)

The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.

(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)

(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.

$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)

edited Apr 15 '14 at 03:28

answered Apr 13 '14 at 11:45

Casimir et Hippolyte

88,009
5
94
125

Thanks, i tested it in text and in http://regexr.com/ but it don't work it say its problem with second ? sign ...(?>).... – Maximus Apr 13 '14 at 12:14
@Maximus: It's because regexr.com doesn't support all the features of notepad++. It works with notepad++ (tested with v6.5.5). If you want a more appropriate online tester, you can use regex101.com (with `mg` modifiers) – Casimir et Hippolyte Apr 13 '14 at 12:16
@Maximus: You can make it works with regexr.com (or javascript) with this modified pattern: `\b(\w+) (\w+)(?=(?:\r?\n.*)*?\r?\n(?:\1 \2|\2 \1)\b)` – Casimir et Hippolyte Apr 13 '14 at 12:23
I did but for some reason i did not work, i tested few ow my files and make files with examples of problem but no go. I trying to figer out why. – Maximus Apr 13 '14 at 12:26
@Maximus: 1) pattern is written for only one space between the two words and no leading or trailing white characters, perhaps there is other or different (tab) white space characters in lines. 2) check that your version of np++ is not too old. 3) check that the regex radio button is choosen and that the dotall checkbox is unchecked. – Casimir et Hippolyte Apr 13 '14 at 12:36
Tried. I have last version of notepad++ (6.5.5). I deleted 2 white space between words and replaced with one. About buttons I am not sure but i tried with checked and unchecked .matches newline and all other boxes in find window but no go. Meabe i am cursed :P. – Maximus Apr 13 '14 at 13:13
1

The main problem is that you didn't give a representative example of the data you are try to deal with! Replace all the spaces in the pattern with `\h+` – Casimir et Hippolyte Apr 13 '14 at 13:20
I can send u via e-mail my is nevenusma@gmail.com. I have more then 10 this files they contain tens of thousand lines of this 2 word lines.PS. i tried \h+ no go again. But men thanks for help, i feel sorry for bother u this much. – Maximus Apr 13 '14 at 13:24
@Maximus: I have edited my answer, this should work now. – Casimir et Hippolyte Apr 15 '14 at 02:49

Find lines with same characters set

2 Answers2