Finding existence each string in file1 in another file

Question

Below is the content on both the files,

File:1

257054
256986
257144

File:2

257054|Guestroom|http://397_b.jpg|350|350||http://397/hotels/2000000/1330000/1321300/1321278/1321278_397_t.jpg|0
257057|Guestroom|http://398_b.jpg|350|350||http://398/hotels/2000000/1330000/1321300/1321278/1321278_398_t.jpg|0

I need a Bash command that will compare two files and the output contains only

257054|Guestroom|http://397_b.jpg|350|350||http://397/hotels/2000000/1330000/1321300/1321278/1321278_397_t.jpg|0

I can use normal for loop iteration, but it is very slow. I need some solution using awk or sed that has quick processing.

tried `grep -f file1 file2` ????? Although not a robust solution — P...., Mar 21 '17 at 13:21
You may have a version of grep that doesn't have the `-f` flag (although I thought that was pretty standard), but that solution works as @PS. described — Palpatim, Mar 21 '17 at 13:30
partial or full match? on one field or several fields or across the whole line? regexp or string comparison? Be specific. — Ed Morton, Mar 21 '17 at 14:20
Thanks PS, i found out the reason,There are some metaspaces in the file1 which is preventing the functionality of this grep command.Now it is working fine, But my file contains close to 10 mil entries , which makes this approach is less performance intensive. — Raghavan, Mar 23 '17 at 11:45

bishop · Answer 1 · 2017-03-21T14:48:48.943

3

If the contents of file1 can only appear in the first position of file2, you can use fgrep:

$ cat file1
257054
256986
257144
$ cat file2
257054|Guestroom|http://397_b.jpg|350|350||http://397/hotels/2000000/1330000/1321300/1321278/1321278_397_t.jpg|0
257057|Guestroom|http://398_b.jpg|350|350||http://398/hotels/2000000/1330000/1321300/1321278/1321278_398_t.jpg|0
$ fgrep -f file1 file2
257054|Guestroom|http://397_b.jpg|350|350||http://397/hotels/2000000/1330000/1321300/1321278/1321278_397_t.jpg|0

Note that you can substitute fgrep with grep -F: both are POSIX. Using the fgrep mode treats the contents of file1 as a set of literal patterns, one per line. Trying grep -f without -F will not give you the desired result.

In the event that the numbers from file1 could exist elsewhere in file2 besides the beginning of line, then you can create a more explicit match by combining grep with, eg, sed:

grep -f <(sed 's/.*/^&|/g' file1) file2

This matches the numbers from file1 only when they appear at the beginning of a line followed by a pipe (|).

edited Mar 21 '17 at 14:48

answered Mar 21 '17 at 13:25

bishop

37,830
11
104
139

but this will match anywhere in the line... given so many numbers present, potential for false match is there... – Sundeep Mar 21 '17 at 13:32
The grep solution is just plain wrong and wrt the grep+awk solution, you never need grep when you're using awk. – Ed Morton Mar 21 '17 at 14:21
1

@EdMorton If one has `awk`, one does not *need* `grep`, `join`, `paste`, `sed`, `nl`, `perl`, and much, much more. The presence of one god tool does not obligate one to use said god tool for all possible problems. Nevertheless, I made it grep+sed, which is a more "reasonable" combination. – bishop Mar 21 '17 at 14:44
1

`The presence of one god tool does not obligate one to use said god tool for all possible problems.` I didn't suggest that. All of the tools you listed are perfectly good tools for what they do. I said if you ARE using the "god tool" you don't need to add the "plumber tool" to help god out installing your kitchen sink. If you're using grep+sed you should be using awk instead. – Ed Morton Mar 21 '17 at 15:05
@EdMorton `If you're using grep+sed you should be using awk instead.` No. That's exactly my point. ["Robustness is the child of simplicity and transparency."](http://www.faqs.org/docs/artu/ch01s06.html) I'd rather have an easy-to-understand grep+sed than a hard-to-follow awk. – bishop Mar 21 '17 at 15:25
1

Yes but there's nothing forcing you to write hard-to-follow awk. Write easy-to-understand awk instead and reap the benefits of clarity, robustness, portability, efficiency, etc. that come with that approach. You're dreaming if you think your grep+sed+shell approach is quantitatively easier to understand than [@Inian's awk script](http://stackoverflow.com/a/42928521/1745001) which stores the file1 values in an array and then checks if the file2 values are in that array - could not be much simpler than that! Anyway, we won't get anywhere continuing this so - good luck with your programming. – Ed Morton Mar 21 '17 at 15:31
1

Someone who needs help solving this problem using grep likely will be unable to maintain an awk solution. Which tool to use is a function of both tool capability and user capability. – bishop Mar 21 '17 at 15:46
1

The `grep -w` (whole word) switch would simplify the code a little: `grep -w -f <(sed s/^/^/ file1) file2`. – agc Mar 22 '17 at 04:50

Inian · Accepted Answer · 2017-03-21T16:10:53.800

3

You can do this in Awk in one shot,

awk 'BEGIN{FS=OFS="|"}FNR==NR{file1[$0]; next}$1 in file1' file1 file2

On file1 hash the contents into the index of array file1 and on file2 print those lines whose $1 is in seen.

edited Mar 21 '17 at 16:10

answered Mar 21 '17 at 13:26

Inian

80,270
14
142
161

1

Thanks Inian, this approach is working fine and quite fast. – Raghavan Mar 23 '17 at 11:46
@Raghavan: Happy you finding it useful! – Inian Mar 23 '17 at 11:54

score 2 · Answer 3 · answered Mar 21 '17 at 13:49

2

You could also use join:

$ join -t \| f1 f2
257054|Guestroom|http://397_b.jpg|350|350||http://397/hotels/2000000/1330000/1321300/1321278/1321278_397_t.jpg|0

man join educates us:

NAME
       join - join lines of two files on a common field

SYNOPSIS
       join [OPTION]... FILE1 FILE2

       -t CHAR
              use CHAR as input and output field separator

answered Mar 21 '17 at 13:49

James Brown

36,089
7
43
59

1

Just to be noted that join requires sorted input files. Also using C localle has been proved that there is a huge performance boost up to 40%. Details: http://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash/42666456#42666456 – George Vasiliou Mar 21 '17 at 15:13

Finding existence each string in file1 in another file

3 Answers3