0

OK, sedAwkPerl-fu-gurus. Here's one similar to these (Extract specific strings...) and (Using awk to...), except that I need to use the number extracted from columns 4-10 in each line of File A (a PO number from a sales order line item) and use it to locate all related lines from File B and print them to a new file.

File A (purchase order details) lines look like this:

xxx01234560000000000000000000 yyy zzzz000000

File B (vendor codes associated with POs) lines look like this:

00xxxxx01234567890123456789001234567890

Columns 4-10 in File A have a 7-digit PO number, which is found in columns 7-13 of file B. What I need to do is parse File A to get a PO number, and then create a new sub-file from File B containing only those lines in File B which have the POs found in File A. The sub-file created is essentially the sub-set of vendors from File B who have orders found in File A.

I have tried a couple of things, but I'm really spinning my wheels on trying to make a one-liner for this. I could work it out in a script by defining variables, etc., but I'm curious whether someone knows a slick one-liner to do a task like this. The two referenced methods put together ought to do it, but I'm not quite getting it.

Community
  • 1
  • 1
noogrub
  • 872
  • 2
  • 12
  • 20

3 Answers3

1
sed 's_^...\(\d\{7\}\).*_/^.\{6\}\1/p_' FIRSTFILE > FILTERLIST
sed -n -f FILTERLIST SECONDFILE > FILTEREDFILE

The first line generates a sed script from firstfile than the second line uses that script to filter the second line. This can be combined to one line too...

If the files are not that big you can do something like

awk 'BEGIN { # read the whole FIRSTFILE PO numbers to an array }
     substr($0,7,7} in array { print $0 }' SECONDFILE > FILTERED

You can do it like (but it will find the PO numbers anywhere on a line)

fgrep -f <(cut -b 4-10 FIRSTFILE) SECONDFILE 
Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
1

Here's a one-liner:

egrep -f <(cut -c4-10 A | sed -e 's/^/^.{6}/') B

It looks like the POs in file B actually start at column 8, not 7, but I made my regex start at column 7 as you asked in the question.

And in case there's the possibility of duplicates in A, you could increase efficiency by weeding those out before scanning file B:

egrep -f <(cut -c4-10 A | sort -u | sed -e 's/^/^.{6}/') B
dg99
  • 5,456
  • 3
  • 37
  • 49
  • Rock and roll! The second one works like a charm - you were right, my column count was off but that's easy to adjust. Thanks for the fast answer. Cut! Cut! I need to remember cut... – noogrub Jul 03 '14 at 19:35
1

Another way using only grep:

grep -f <(grep -Po '^.{3}\K.{7}' fileA) fileB

Explanation:

  1. -P for perl regex
  2. -o to select only the match
  3. \K is Perl positive lookbehind
Tiago Lopo
  • 7,619
  • 1
  • 30
  • 51