Discussion:
List of NOT duplicate files
(too old to reply)
Neil
2023-03-12 19:08:13 UTC
Permalink
I've been doing this manually but I'm asking if there is a better way.

I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).

Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.

That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.

What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.

Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.

Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
--
best regards,

Neil
Alan Browne
2023-03-12 19:23:34 UTC
Permalink
Post by Neil
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in folder
B that aren't yet in folder A when I move just those files over?
Being lazy, I just pasted everything above into ChatGPT. It immediately
output the following. Note that I'm not a Python programmer, so can't
vouch for this. In particular I'm not sure if it differences files with
the same name but different content.

ChatGPT:
==================================
Yes, there are several ways to obtain the list of files in folder B that
are not duplicates of what's already in folder A. One way is to use a
scripting language like Python to compare the file names in folder A and
folder B and identify the files in folder B that are not already in
folder A. Here's an example Python script that does this:

python
Copy code
import os

folderA = 'path/to/folderA'
folderB = 'path/to/folderB'

filesA = set(os.listdir(folderA))
filesB = set(os.listdir(folderB))

newFiles = filesB - filesA

print(newFiles)
This script uses the os module to get a list of files in folder A and
folder B, and then creates sets of the file names. It then subtracts the
set of files in folder A from the set of files in folder B to get a set
of new files. The newFiles set contains the names of the files in folder
B that are not already in folder A.

You can modify the folderA and folderB variables to point to the actual
paths of your folders. You can save the script as a .py file and run it
in a Python environment, or you can run it from the command line using
python script.py.

This is just one example of how to solve the problem, and there are many
other ways to approach it depending on your specific requirements and
preferences.
==================================
--
“Donald Trump and his allies and supporters are a clear and present
danger to American democracy.”
- J Michael Luttig - 2022-06-16
- Former US appellate court judge (R) testifying to the January 6
committee
Alan Browne
2023-03-12 19:30:05 UTC
Permalink
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in folder
B that aren't yet in folder A when I move just those files over?
...
So, then I asked ChatGPT to be sure it accounted for the case of the
same name, but different content;

ChatGPT:
==================================
No, the script I provided only checks for new files based on their
filename, so it doesn't take into account the contents of the files.

If you want to check for new files based on both their filename and
contents, you can modify the script to read the contents of each file
and compare them before deciding whether to add the file to folder A or
not. Here's an example of how you can do this in Python:


import os
import filecmp
import shutil

folderA = 'path/to/folderA'
folderB = 'path/to/folderB'

# Get a list of file names in folder A
filesA = os.listdir(folderA)

# Iterate over each file in folder B
for fileB in os.listdir(folderB):
# Check if the file already exists in folder A
if fileB in filesA:
# If the file already exists, compare the contents
if not filecmp.cmp(os.path.join(folderA, fileB),
os.path.join(folderB, fileB), shallow=False):
# If the contents are different, copy the file to folder A
with a unique name
shutil.copy2(os.path.join(folderB, fileB),
os.path.join(folderA, fileB + '_new'))
else:
# If the file doesn't exist in folder A, copy it over
shutil.copy2(os.path.join(folderB, fileB),
os.path.join(folderA, fileB))
==================================

Which may be better. Please give it a go!
--
“Donald Trump and his allies and supporters are a clear and present
danger to American democracy.”
- J Michael Luttig - 2022-06-16
- Former US appellate court judge (R) testifying to the January 6
committee
Herbert Kleebauer
2023-03-12 19:43:43 UTC
Permalink
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
When in directory b use this command (double the % if used in a batch program):

for %i in (*.txt) do if not exist ..\a\%i copy %i ..\a\

Maybe you want to use "move" instead of "copy"

But maybe the simplest way is to use file explorer and copy all
files of b/ to a/ and select don't replace existing files.
Neil
2023-03-13 14:14:15 UTC
Permalink
Post by Herbert Kleebauer
Post by Neil
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\a\
Maybe you want to use "move" instead of "copy"
But maybe the simplest way is to use file explorer and copy all
files of b/ to a/ and select don't replace existing files.
I should have been clear that I can easily copy (or move) the files
selectively because the native Windows command has an overwrite yes/no.

What I need is a separate _list_ of what was moved (or copied) over.
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\a\
echo %i > list.txt

What I do now is compare time stamps of the resulting combined folder.
But sometimes the time stamps aren't what I would have thought they'd be.
--
best regards,

Neil
Herbert Kleebauer
2023-03-13 17:18:16 UTC
Permalink
Post by Neil
Post by Herbert Kleebauer
Post by Neil
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\a\
Maybe you want to use "move" instead of "copy"
But maybe the simplest way is to use file explorer and copy all
files of b/ to a/ and select don't replace existing files.
I should have been clear that I can easily copy (or move) the files
selectively because the native Windows command has an overwrite yes/no.
What I need is a separate _list_ of what was moved (or copied) over.
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\a\
echo %i > list.txt
What I do now is compare time stamps of the resulting combined folder.
But sometimes the time stamps aren't what I would have thought they'd be.
You just answered your own question (but don't use the name
"list.txt" or it will also be listed as new text file):
While in directory b\ enter at the command prompt (but first
delete new.log if it already exists):

for %i in (*.txt) do if not exist ..\a\%i echo %i>>new.log

Or make a batch file which first changes the working
directory to b\, deletes new.log if it already exists
and then executes the above line (but double any %).
Then it doesn't matter where the batch is located and
you can start it with a double click with the mouse.
Neil
2023-03-14 01:00:20 UTC
Permalink
Post by Herbert Kleebauer
Post by Neil
What I do now is compare time stamps of the resulting combined folder.
But sometimes the time stamps aren't what I would have thought they'd be.
You just answered your own question (but don't use the name
While in directory b\ enter at the command prompt (but first
for %i in (*.txt) do if not exist ..\a\%i echo %i>>new.log
Or make a batch file which first changes the working
directory to b\, deletes new.log if it already exists
and then executes the above line (but double any %).
Then it doesn't matter where the batch is located and
you can start it with a double click with the mouse.
Thank you for a way to diff list file names that are in B but not A.
cd B
for %i in (*.txt) do if not exist C:\path\to\A\%i echo %i >> new.log

To which I added the copy I had been using with my manually created diff.
cd B
echo "No" | copy/-Y *.txt C:\path\to\A\

I like that your copy command also copied what's in B not already in A.
cd B
for %i in (*.txt) do if not exist C:\path\to\A\%i copy %i C:\path\to\A\

To explain what I was doing before you provided that nice diff listing
and diff batch copy commands, and using the names for that diff file that
are in this thread...

Before, what I was doing was _manually_ creating the diff file on my own.
And then I was editing that diff file to create a new.bat on my own.

vim new.log
:%s/.*/echo n | copy & C\:\\path\\to\\A\\&
:w! new.bat

The explanation of that is ":" starts a command inside of vim.
The "%" says to run the command on every line of the file.
The "s/.*" says search for any character & then zero or any number of them.
The "/" means to replace what that search found with what comes next.
Next is what to replace it with (the ampersand being what was found).
The extra backslashes are to escape characters which have other meanings.

Where the resulting new.bat file contained the respective copy commands:
echo n | copy file1.txt C:\path\to\A\file1.txt
echo n | copy file2.txt C:\path\to\A\file2.txt
echo n | copy file3.txt C:\path\to\A\file3.txt

But your method is more elegant than my manual editing of the diff file
to turn that diff file of what's in B not yet in A into a batch copy file.

@echo off
cd C:\path\to\B
for %%i in (*.txt) do if not exist C:\path\to\A\%%i echo %%i >> new.log
for %%i in (*.txt) do if not exist C:\path\to\A\%%i copy %%i C:\path\to\A\

Thank you!
--
best regards,

Neil
😉 Good Guy 😉
2023-03-12 22:30:00 UTC
Permalink
Post by Neil
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
Sure there is but it depends on whether you have excel installed on your
machine or whether you are prepared to use the Online version of Excel.
Assuming you have Excel on your machine or online version then read this
article:

<https://support.microsoft.com/en-us/office/find-and-remove-duplicates-00e35bea-b46a-4d5d-b28e-66a552dc138d>

To create a list of files is by using a command line on the folder and
pipe the output to a text file?

Something like this: "dir /b > listA.txt"


Please scroll down and click on the Submit button after you have
completed the required fields.

<https://contact.mainsite.tk/>
<https://contact.mainsite.tk/>
<https://contact.mainsite.tk/>
<https://contact.mainsite.tk/>
<https://contact.mainsite.tk/>
--
https://contact.mainsite.tk
Philip Herlihy
2023-03-13 11:38:02 UTC
Permalink
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
Robocopy (built-in to all recent versions of Windows) can do this. It takes
work to get to a basic competence with it, but it's so useful that it's well
worth it. There's a guide out there (Google Robocopy.doc) or type robocopy /?
into a command window. There are a LOT of command options!

Essentially robocopy <folder1> <folder2> will _selectively_ copy files from 1
to 2, only if they are newer or otherwise different. Exactly how it chooses
this is controlled by the options. Usefully, the /L option does all the
calculations but doesn't move anything. Dangerously, the /MIR option deletes
anything in the target folder that wasn't found in the source folder.

The way I often use it is to navigate in a command window to my desktop, so job
files (described below) and log files are readily seen there. I run:
robocopy <f1> <f2> <options, including /L> /SAVE:jobfile.rcj /QUIT
This creates a "job file" which can be re-run with this syntax:
robocopy /JOB:jobfile.rcj
You then edit the jobfile (double-colon comments out a line) to fool around
with the options, studying the output (you can usefully create a /LOG file and
/TEE the output to the command window). When you're confident you have it
right, you remove the /L and the files get moved next time.

In your situation, you should pay particular attention to the file "classes":
Newer, Different, Older, etc, etc. Robocopy will elegantly do exactly what you
want. It's a utility I only occasionally use, but when I do it's absolutely
invaluable.
--
Phil, London
Mr. Man-wai Chang
2023-03-13 16:06:51 UTC
Permalink
List of NOT duplicate files
"NOT duplicate"" = Unique? :)
Neil
2023-03-13 17:19:52 UTC
Permalink
Post by Mr. Man-wai Chang
List of NOT duplicate files
"NOT duplicate"" = Unique? :)
It doesn't really matter what the contents of the files are.

I only brought up the contents being different because a dumb move wouldn't
work because that would overwrite files of the same name in A with the new
contents of another file of the same name from B.

A is the master folder.

B is just the new files (some of which may have the same name and hence
would be discarded).

I should also be clear the copy (or move) isn't the problem because Windows
already won't copy over the file in A if it has the same name as in B.

What I'm after is the _list_ of files that were copied (or moved) over from
B to A (those are all going to be file names which were not in A prior).

Kind of like this pseudish code snip
cd B
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\A\
echo %i >> list_of_file_names_in_b_not_in_a.txt

I don't even need to do the copy/move because that's the easy part.
The hard part for me is getting a list of file names in B not already in A.
--
best regards,

Neil
Zaidy036
2023-03-13 20:14:58 UTC
Permalink
Post by Neil
Post by Mr. Man-wai Chang
List of NOT duplicate files
"NOT duplicate"" = Unique? :)
It doesn't really matter what the contents of the files are.
I only brought up the contents being different because a dumb move wouldn't
work because that would overwrite files of the same name in A with the new
contents of another file of the same name from B.
A is the master folder.
B is just the new files (some of which may have the same name and hence
would be discarded).
I should also be clear the copy (or move) isn't the problem because Windows
already won't copy over the file in A if it has the same name as in B.
What I'm after is the _list_ of files that were copied (or moved) over from
B to A (those are all going to be file names which were not in A prior).
Kind of like this pseudish code snip
cd B
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\A\
echo %i >> list_of_file_names_in_b_not_in_a.txt
I don't even need to do the copy/move because that's the easy part.
The hard part for me is getting a list of file names in B not already in A.
A list of file names in a folder:
DIR /B <full path folderA> > X:\FolderA.txt
where X: is a full path and not required if same folder as folderA

So make a batch:
1. Make FolderA.txt
2. Make FolderB.txt
3. FOR /f %%A in ('DIR /A:D /B "FolderB.txt" ^| FIND /V /C ""') DO SET
cnt=%%A
4. Make a loop:
FOR /L %%f IN (1,1,%cnt%) DO (
SET /P _Fname=<FolderB.txt
IF EXIST %<full path folderA>%\!_Fname! echo !_Fname!>> X:\FolderC.txt
MORE +1 FolderB.txt
)
FolderC will contain file names from FolderB that are also in FolderA
Graham J
2023-03-13 20:46:33 UTC
Permalink
Zaidy036 wrote:


Robocopy will do what you want.

Use its logging feature, and it will report the files it copies.
--
Graham J
knuttle
2023-03-13 20:47:00 UTC
Permalink
Post by Zaidy036
Post by Neil
Post by Mr. Man-wai Chang
List of NOT duplicate files
"NOT duplicate"" = Unique? :)
It doesn't really matter what the contents of the files are.
I only brought up the contents being different because a dumb move wouldn't
work because that would overwrite files of the same name in A with the new
contents of another file of the same name from B.
A is the master folder.
B is just the new files (some of which may have the same name and hence
would be discarded).
I should also be clear the copy (or move) isn't the problem because Windows
already won't copy over the file in A if it has the same name as in B.
What I'm after is the _list_ of files that were copied (or moved) over from
B to A (those are all going to be file names which were not in A prior).
Kind of like this pseudish code snip
cd B
for %i in (*.txt) do if not exist ..\a\%i copy %i ..\A\
echo %i >> list_of_file_names_in_b_not_in_a.txt
I don't even need to do the copy/move because that's the easy part.
The hard part for me is getting a list of file names in B not already in A.
DIR /B <full path folderA> > X:\FolderA.txt
where X: is a full path and not required if same folder as folderA
1. Make FolderA.txt
2. Make FolderB.txt
3. FOR /f %%A in ('DIR /A:D /B "FolderB.txt" ^| FIND /V /C ""') DO SET
cnt=%%A
   FOR /L %%f IN (1,1,%cnt%) DO (
   SET /P _Fname=<FolderB.txt
   IF EXIST %<full path folderA>%\!_Fname! echo !_Fname!>> X:\FolderC.txt
   MORE +1 FolderB.txt
   )
FolderC will contain file names from FolderB that are also in FolderA
There is another way that is not elegant.

Run the following Batch file in each directory

Dir *.* /s >text.txt

Open the first text file in column A of the spreadsheet and sort the column

open the second text file in Column C of the spreadsheet and sort that
column. You may need to open the second text file in another spreadsheet
and copy it to column C of the first spreadsheet.

enter A1=C1 into all of the cells in Column B,

Start at the top and every time you see a 0, in column B, make what ever
adjustments are necessary in Columns A or C to make the cells equal
again A1=C1 is 1 you may need to copy the cell in Column B down to the
bottom, depending how you make the adjustment.

When Column B has all 1, you have identical date in the cells in rows A
and C.

I find I can compare the two data sets using this method faster that I
can create and debug an scripts or command
Big Al
2023-03-14 12:41:25 UTC
Permalink
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in folder B that aren't yet in folder A when I move just
those files over?
You might use meld.
meld <folder A> <folder B>
It will highlight all the files that are not in A. Also it will highlight all the files that are in both but the text
in them is diff.
--
Al
wasbit
2023-03-15 09:37:49 UTC
Permalink
Post by Big Al
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in
folder B that aren't yet in folder A when I move just those files over?
You might use meld.
meld <folder A> <folder B>
It will highlight all the files that are not in A.  Also it will
highlight all the files that are in both but the text in them is diff.
Presume you mean
- http://meldmerge.org/

-
Regards
wasbit
Big Al
2023-03-15 12:20:05 UTC
Permalink
Post by wasbit
Post by Big Al
Post by Neil
I've been doing this manually but I'm asking if there is a better way.
I have two folders of text files, some of which have duplicate file names.
The _contents_ of those text files, even with the same name, are different.
But everything is dependent only on the file names (not on their contents).
Folder A is the master folder containing thousands of A\*.txt files.
Folder B is the folder whose new files (only!) are added to folder A.
That is, if a B\*.txt file name is already in folder A, it's not added.
Only if a B\*.txt file name is NOT already in folder A, is it added.
Folder A always grows but it never contains a duplicate file name from B.
What I need is a list of the files that were in folder B that were not
duplicates of what was already in folder A when they were copied/moved.
Is there an easy way to obtain that list which is better than how I'm
manually doing it today, which is to make a list of what file names are
originally in A and I make a list of the file names in B that are left
after I copy/move the files over without duplication and then I subtract
one list from the other.
Is there an easier way to get that list of what file names are in folder B that aren't yet in folder A when I move
just those files over?
You might use meld.
meld <folder A> <folder B>
It will highlight all the files that are not in A.  Also it will highlight all the files that are in both but the text
in them is diff.
Presume you mean
 - http://meldmerge.org/
  -
Regards
wasbit
Yes.
--
Al
Loading...