Text processing using awk when separator is part of word?

Question

I have a CSV file included 11 columns with the similar content

 SE Australia|PRM|2017-09-07T16:11:33|2641|-5537383165259899960|2017-09-07T16:12:17|"AU en2|networking-locator"|-|SC7_Electricians_Installer (only provides labor)|p-0715125|1

I am trying to use awk for separating each column, The problem is that in some sentences among 10 million records, the separator(pip) is part of the word. As you can see in below, pip is included in text "AU en2|networking-locator". Using the following command returns a wrong info.

awk -F "|" '{print $4"_"$6"_"$7"_"$10}'

The result

2641_2017-09-07T16:12:17_"AU en2_p-0715125

The excepted result,

2641_2017-09-07T16:12:17_"AU en2|networking-locator"_p-0715125

As you see AU en2 considered as a separate column, however, is part of AU en2|networking-locator. How can I change awk command in order to cover those columns?

Your sample Input and sample output is NOT clear? Please post them more clearly with CODE TAGS and let me know then. — RavinderSingh13, Apr 08 '18 at 11:42

hek2mgl · Answer 1 · 2018-04-08T21:06:00.873

2

You need GNU awk for that. With gawk you can use the FPAT variable:

gawk '{print $4,$6,$7,$10}' OFS=_ FPAT='"[^"]+"|[^|]+' file

Using FPAT you can tell awk how a field looks like instead of being limited to specify a field delimiter.

In the above example we are telling that a field is either a " followed by one ore more non " chars and a closing " or a sequence of non | chars. Those rules will be evaluated in order which gives the first a higher precedence.

Output:

2641_2017-09-07T16:12:17_"AU en2|networking-locator"_p-0715125

PS: The above solution is slower than splitting by a fixed char. As your file is 100 million lines long it might take very long to process.

If the file contains the "abc|xyz" fields only on position $7 and it is safe that in this situations there is just a single | in $7, then you can use this hack:

awk -F\| '$7~/"/{$7=$7"|"$8;$10=$11}{print $4,$6,$7,$10}' OFS=_ file

It should be much faster than the above solution but it works only under the mentioned circumstances. You have been warned!.

edited Apr 08 '18 at 21:06

answered Apr 08 '18 at 11:57

hek2mgl

152,036
28
249
266

It looks awk using this parameter works very slowly with a huge amount of data. I have started the command since 20 min ago and still is working. with awk, the same file will process in 3 min. – pm1359 Apr 08 '18 at 12:22
Sure, it requires more effort to split by a complex regex than splitting by a fixed char. You'll have to wait (independently of the programming language you use and awk is already pretty fast) If you can change the process, make sure that you use a field delimiter which is not part of the data. – hek2mgl Apr 08 '18 at 12:30
@MaryamPashmi I added a faster solution. But please check the requirements for it. – hek2mgl Apr 08 '18 at 12:46
1

The first one might be a little faster if you set OFS and use commas between output fields rather than using string concatenation (a slow operation) with hard-coded underscores. – Ed Morton Apr 08 '18 at 13:12
Good point! Changed... – hek2mgl Apr 08 '18 at 21:06

Text processing using awk when separator is part of word?

1 Answers1