How to remove path from filename in a csv-file

Question

I have a csv-file full with lines such as this: 8;;

Grote schoudertas met gekleurde borduursels &nbsp"Twee Hanen"

De tas is gemaakt van een stijf vilt met een dikte van 4 mm waardoor deze goed zijn vorm houdt
Aan de achterkant heeft de tas een vak met ritssluiting
De voering van de tas is van stof
Binnenin is een afsluitbaar vak met een rits
Ook is er een vak voor de telefoon
De tas is ruim en praktisch

AFMETINGEN:

Hoogte met handvaten: 46 cm (verstelbaar 7 cm)
Hoogte: 34 cm
Breedte in het midden: 42 cm
Bodemmaat: 30 x 10 cm&nbsp
Schouderriem lengte instelbaar van 55 cm tot 130 cm
Gebruikte materialen: vilt en kunstleer
Productiemethode: handwerk

;56.95;Vilten tas met twee hanen in kleur http://staging.tassenmagazijn.nl/media/catalog/product/cache/0/image/a94bc919ee025799dd7ec1f1b7884918/1/0/10_vilten_tas_2b_2.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilten_tas_2c_1.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilten_tas_2d_1.jpg;;;;;;;Fixed;New;Send;True;

The included files are with full path and I just want the filename.

So http://staging.tassenmagazijn.nl/media/catalog/product/cache/0/image/a94bc919ee025799dd7ec1f1b7884918/1/0/10_vilten_tas_2b_2.jpg; becomes 10_vilten_tas_2b_2.jpg;

As you can see is the structure of the path not always the same. Is there some kind of linux-command I can use?

are all paths/files (`http://path/files1.jpg;http://path/files2.jpg`) on one line separated by `;` or on separate lines? They look to be on the same line, but I'm not sure if that is an editing problem in the question or real. — David C. Rankin, Jul 17 '15 at 08:44
Both, there are 4 path/files in one line and several of those lines — Peter, Jul 17 '15 at 08:45

score 1 · Accepted Answer · answered Jul 17 '15 at 11:14

1

assuming that all of your paths are http://<anything>/<filename> , then

sed 's~http://.*/~~' <file>

will transform e.g.:

http://staging.tassenmagazijn.nl/media/catalog/product/cache/0/image/a94bc919ee025799dd7ec1f1b7884918/1/0/10_vilten_tas_2b_2.jpg

to

10_vilten_tas_2b_2.jpg

answered Jul 17 '15 at 11:14

FelixJN

540
3
16

Wow great this works like a charm and show the simplicty and power of sed – Peter Jul 17 '15 at 13:19
Sorry it looks okay but my 3 images are reduced to just one filename. Is it possible to use this functions but retain alle the filenames (in this case 3). Now its http://[@@@]/filename1.jpg;http://[@@@]/filename2.jpg;http://[@@@]/filename3.jpg; becomes just filename3.jpg but I would like filename1.jpg;filename2.jpg;filename3.jpg; – Peter Jul 17 '15 at 13:45
not as simple with `sed`, but `sed -e 's~http://~\n&~g' file | sed 's~http://.*/~~' | sed ':a;N;$!ba;s/;\n/;/g'` will a) replace http:// with a newline and http:// , b) remove the url-part , c) remove the intermediate newlines. `sed`'s * is greedy, and I'd suggest alternatives with non-greedy pattern matching. See answer to this problem: (http://stackoverflow.com/questions/1103149/non-greedy-regex-matching-in-sed) – FelixJN Jul 17 '15 at 14:09
Okay this is much better. 1. Paths are all removed 2. Sadly it ends up in one long line instead of several lines – Peter Jul 17 '15 at 20:31
I don't understand "one long line" here. Is the command too ugly (where I agree - `sed` is not the best solution here, as said before) or did you want the filenames in separate lines (which can be fixed) or is the original text file changed? – FelixJN Jul 17 '15 at 20:33
Sorry my mistake (my English is not that well) I meant instead of severall lines all the lines are placed in one line (concat) Instead of line1 (newline) line2 (newline) line3 it's become line1line2line3(newline) – Peter Jul 17 '15 at 20:44
I'm sorry, I still don't understand: is A) the whole output text written in a single line (i.e."Grote schoudertas met ..." in a single line) (if so please check if you did not forget any part from `sed ':a;N;$!ba;s/;\n/;/g'`) or B) Did you want "picture1.jpg; (newline) picture2.jpg; (newline) picture3.jpg" – FelixJN Jul 19 '15 at 11:25
Fixman, I'm so sorry but I think I did soemthing wrong because after another try or your code it all worked just fine. Many thanks and sorries – Peter Jul 19 '15 at 17:38

David C. Rankin · Answer 2 · 2015-07-17T15:12:46.703

This will largely depend on your shell and the tools it has available to read up to the point of a delimiter, ';' in this case. If you have BASH or some similar shell, then the solution is trivial with substring removal:

#!/bin/bash

while read -d ';' -r line; do
    ((${#line} >= 12)) && [[ $line =~ http ]] &&
    printf "%s\n" ${line##*/}
done < "$1"

If you are limited to POSIX shell, then the solution takes quite a bit more work and you will relay on sed to parse the URL once it is isolated. Isolating each URL from a line containing multiple URLs separated by semi-colons is a bit tricky. In POSIX shell you basically have to inchworm down each line read reading a character at a time (which is slow on large files) The following validates that each line considered as a URL contains http at the beginning:

#!/bin/sh

url=""
while read -r line; do 
    len=`expr length "$line"`
    urlstart=`expr index "$line" "h"`
    line=`expr substr "$line" "$urlstart" "$len"`
    while [ $len -gt 0 ]; do 

        c=`expr substr "$line" 1 1`

        if [ x$c = 'x;' ]; then
            if [ `expr length "$url"` -ge 12 ]; then
                # printf "url: %s\n" "$url"
                ulen=`expr length "$url"`
                urlstart=`expr index "$url" "h"`

                if [  $urlstart -gt 0 ]; then

                    if [ $urlstart -gt 1 ]; then
                        url=`expr substr "$url" "$urlstart" "$ulen"`
                    fi
                    urlflag=0

                    while [ `expr substr "$url" 1 4` != http ]; do
                        url=`expr substr "$url" 2 "$ulen"`
                        urlstart=`expr index "$url" "h"`
                        if [ "$urlstart" -eq 0 ]; then
                            urlflag=1
                            break
                        fi
                        url=`expr substr "$url" "$urlstart" "$ulen"`
                        ulen=`expr length "$url"`
                        if [ $ulen -le 12 ]; then
                            urlflag=1
                            break
                        fi
                    done

                    if [ $urlflag -ne 1 ]; then
                        if [ `expr substr "$url" 1 4` = http ]; then 
                            echo "$url" | sed -e 's/http.*\///'
                        fi
                    fi

                fi
            fi
            url=""

        else
            url="$url$c"
        fi
        line=`expr substr "$line" 2 "$len"`
        len=`expr length "$line"`
    done
done <"$1"

If you can ensure that only lines longer than some constant are the URLs, then you can dramatically improve the performance of the POSIX solution by not searching for and validating each string contains http. To parse the URLs based on length, something similar to the following will work:

#!/bin/sh

while read -r line; do 

    printf "\n%s\n\n" "$line"
    len=`expr length "$line"`
    sidx=`expr index "$line" ";"`

    while [ $len -gt 0 ]; do 

        if [ $sidx -gt 0 ]; then 
            let end=sidx-1
            str=`expr substr "$line" 1 "$end"`
            slen=`expr length "$str"`
            if [ $slen -gt 12 ]; then
                echo "$str" | sed -e 's/^.*\///'
            fi
        else
            if [ $len -gt 12 ]; then
                echo "$line" | sed -e 's/^.*\///'
            fi
            break;
        fi

        let start=sidx+1
        line=`expr substr "$line" "$start" "$len"`

        len=`expr length "$line"`
        sidx=`expr index "$line" ";"`

    done

done <"$1"

The length of 12 was simply arrived at as the shortest possible URL for a jpeg file (e.g. http://a.jpg )

In all cases, the results are the same for the example file you have given:

Input

$ cat dat/httppaths.txt
;56.95;Vilten tas met twee hanen in kleur http://staging.tassenmagazijn.nl/\
media/catalog/product/cache/0/image/a94bc919ee025799dd7ec1f1b7884918/1/0/10_vilte\
n_tas_2b_2.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilte\
n_tas_2c_1.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilte\
n_tas_2d_1.jpg;;;;;;;Fixed;New;Send;True;

Use/Output

$ sh parsehttppath.sh dat/httppaths.txt
10_vilten_tas_2b_2.jpg
10_vilten_tas_2c_1.jpg
10_vilten_tas_2d_1.jpg

Federovsky · Answer 3 · 2015-07-17T10:07:41.623

0

Try something like this:

cat file.txt | grep jpg | grep http | grep "/" | awk -F "/" '{ for(i = 1; i <= NF; i++) if ($i ~ "jpg") {print $i} }' | awk -F ";" '{print $1}' | xargs

I made some assumptions about the jpg suffix, and the http...

edited Jul 17 '15 at 10:07

answered Jul 17 '15 at 09:11

Federovsky

74
5

How to remove path from filename in a csv-file

3 Answers3