I have directories with xml files. Each XML file looks similar to this:
<songs>
<song>
<title>Some Title</title>
<artist>Artist</artist>
</song>
<song>
<title>Other Title</title>
<artist>Artist 1</artist>
</song>
<song>
<title>Some Title</title>
<artist>Artist/artist>
</song>
</songs>
My Postgresql table have three columns (title, artist, date of insert). I'm using RecursiveDirectoryIterator to iterate through all xml files.
QUESTIONS:
What will be the best approach to accomplish bulk Upsert (the upsert will be insert if not exists and update the date of insert if exist. The distinct is artist and title pair)?
Is there any way to pass the XML directly to the Postgresql by getting it with:
simplexml_load_file
Or i've to get it as string with:
file_get_contents
And then parse it back to XML within the Postgresql.
- What will be the preferred way to accomplish this task. Generate a huge XML in the php and then pass it to Postgresql, or send it file by file(file chunks method)?
I've tried Inserting by Rows from the XML but this will be extremely inefficient when having lets say 100XML files with 100 songs each. So I'm trying to wrap my head about the way to Bulk Upsert on File level, but I'm stuck..
ANY help will be highly appreciated...
EDIT:1 I want to accomplish this:
INSERT INTO Songs(Artist, Title, DateOfInsertion)
SELECT x.Artist,
,x.Title
FROM xpath('//song', xml('<songs><song><title>Some Title</title><artist>Artist</artist></song><song><title>Other Title</title><artist>Artist 1</artist></song><song><title>Some Title</title><artist>Artist/artist></song></songs>'))) x;
EXCEPTION WHEN unique_violation THEN
UPDATE Songs
SET DateAdded = (now() at time zone 'utc')
WHERE (?????)
EDIT:2 This is what i came up with:
CREATE OR REPLACE FUNCTION insertSongs(xml TEXT) RETURNS VOID AS
$$
BEGIN
LOOP
-- first try to update the key
UPDATE Songs
SET DateAdded = (now() at time zone 'utc')
FROM unnest(xpath('//song', xml(xml))) x
WHERE Artist = extract_value('//artist', x)
AND Title = extract_value('//title', x);
IF found THEN
RETURN;
END IF;
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO Songs(Artist, Title)
SELECT extract_value('//artist', x)
,extract_value('//title', x)
FROM unnest(xpath('//song', xml(xml))) x;
RETURN;
EXCEPTION WHEN unique_violation THEN
-- Do nothing, and loop to try the UPDATE again.
END;
END LOOP;
END;
$$
LANGUAGE plpgsql;
select insertSongs('<songs><song><artist>A</artist><title>A</title></song><song><artist>B</artist><title>B</title></song></songs>')
And The Extract_Value function looks like this:
CREATE OR REPLACE FUNCTION extract_value(
VARCHAR,
XML
) RETURNS TEXT AS
$$
SELECT CASE WHEN $1 ~ '@[[:alnum:]_]+$'
THEN (xpath($1, $2))[1]
WHEN $1 ~* '/text()$'
THEN (xpath($1, $2))[1]
WHEN $1 LIKE '%/'
THEN (xpath($1 || 'text()', $2))[1]
ELSE (xpath($1 || '/text()', $2))[1]
END::text;
$$ LANGUAGE 'sql' IMMUTABLE;
But now only one of the statements is executing.(e.g) If i have Record for Song with Artist A and Title A already in the DB but not having Artist B and Title B the procedure updates only A A Date of insertion and skips the insert for B B. If i don't have both, it inserts them.