The Digital Trojan Horse: Widespread Data Collection and Its Impact on Equity
- Alina Huang
- Nov 18
- 6 min read
Original Article by Christophe Perrenoud, Huong Perrenoud, and Pastor Perrenoud
The technological disruption of Artificial Intelligence (AI) is reshaping the world, society, and everyday life. Its promise is profound, but its fallible deployment casts a growing shadow—threatening to erode privacy and digital well-being. This invitation presents the postulation of danger in policy oversight of mass public data collection and AI-driven cybersecurity threats, by addressing the crossroads of AI and the unregulated data pipelines nourishing its development (IAPP). AI systems are already processing Open-Source Intelligence as OSINT, raising concerns about the digestion of expansive unregulated data streams (Pavlovic). Existing laws that classify sensitive personal information as “publicly available” contribute to cumulative vulnerabilities along with the uses of AI for exploitation (ICCT). A dilemma: staging individuals to endure risk, diminishing the rights to privacy and security in the digital epoch. This discussion focuses on how the United States may anticipate on preparing for AI’s impacts and by exposing critical flaws in existing privacy laws and formulating contingencies to ensure equity, privacy, the preservation of identity, and security.
The Problem: Unregulated Data Collection and Its Societal Impact
AI models are trained and refined by feeding data from purchase contents, public, and semi public sources; often data is scraped without permission. Social Media and OSINT sources are in abundance—many digital users show little restraint and consideration towards fairness, consequences, personal and societal impacts. Contemporary digital behaviors propels the availability of exploitable information used for cyberattacks and utilizing AI to generate convincing scripts for social engineering, scams, and manipulative content (RAND).
The widespread growth of collected data creates a dangerous combination of stolen information from cybersecurity breaches and black-market data dumps. Scattered details (email addresses, passwords, names, financial information) from breaches can be matched with publicly collected data from brokers. This allows AI to construct full identity profiles and conduct digital reconnaissance on individuals. Consequently, a new kind of liability emerges, where a person’s online history, property records, and personal connections are automated into a detailed dossier.
Two identifiable threats stand out:
• AI-Powered All-in-One Cyberattack Tools: AI can extract and extrapolate data to generate personal profiles, produce deepfake audio and visuals, and target individuals, families, or professional networks. AI can automate phishing campaigns by leveraging compromised accounts and generating convincing extortion emails, as demonstrated in a documented cybercrime spree (Collier).
• Unaccountable AI Models: Data brokers operate in a legal gray area, selling information that, while individually “public,” becomes far more dangerous when combined—illustrating Aristotle’s idea that the whole can be greater than the sum of its parts. This aggregated data trains AI models that automate targeted scams, unfair practices, and other forms of exploitation.
Currently, due to the complexity of technology and regulations, individuals cannot practically know what information has been collected about them, who purchased it, or how it is used. This lack of transparency and control is a central issue to addressed in order to ensure a fair and safe digital future.
Case Study: The “Public Records” Loophole
A clear example arises in the real estate sector. Websites such as Homes.com offer comprehensive information on properties, mortgages, and ownership records—acting as valuable “goldmine” sources for reconnaissance. When individuals request to have sensitive information removed, the companies often reply that “the data is ‘publicly published’ and therefore cannot be deleted”, suggesting instead that users should “contact their local county recorder’s office”. This process is legally challenging and unlikely resolved.
This illustrates a loophole: a data-rich component or record (a deed) may be public, compiled and aggregated in mass-scale collection by a for-profit company for unbeknownst uses and unregulated. Such collections can be purchased by anyone, including malicious actors, for purposes ranging from automated penetration attacks, campaign phishing, to identity theft. Critically, the mapping of aggregated data of stolen fragments from breaches and black market to real identities, enabling precise targeting.
Policy Analysis: Strengths and Weaknesses of the CCPA
The California Consumer Privacy Act (CCPA) and its successor, the California Privacy Rights Act (CPRA), serve as early models for data privacy regulation in the United States.
Strengths:
• Know and Delete: Consumers have the right to know what personal information is collected and request deletion, enabling a basic level of transparency.
• Opt-Out of Sale and Sharing: The law requires businesses to provide a “Do Not Sell or Share My Personal Information” (Thomson Reuters).
• Sensitive Information Protections: The CPRA protects “sensitive personal information,” such as social security numbers and biometrics, and allows consumers to restrict its use (Jackson Lewis).
Weaknesses:
• Publicly Available Information Exception: The CCPA excludes publicly available information. This loophole permits data brokers to compile and resell records, as illustrated in the Homes.com case (Digital Life Initiative).
• Enforcement Challenges: Enforcement rests with the state Attorney General, who has limited resources, impeding due process of high volume of cases (Street Fight).
• Opt-Out vs. Opt-In: The law relies on opt-out, which burdens consumers—discriminatory towards older, disabled, or non-tech-savvy populations. A stronger opt-in standard would provide more equitable and inclusive protection in the AI age.
While the CCPA is commendable, its weaknesses expose gaps in the legal framework, allowing exploitation by an unregulated data supply chain.
Policy Recommendations for an Equitable AI Future
To prepare the country for the societal impacts of AI and Cybersecurity, this propose a new policy framework in actions:
• Regulate Public Records Aggregation: Establish a clear definition and distinction between a single, publicly available record and mass-scale, for-profit collection of those records. Require companies that collect publicly available data to register as data brokers and be subject to strict rules on data minimization, purpose limitation, and deletion. Making it more difficult for criminals to purchase the data needed to enrich information from a breach.
• Establish National Data Broker Registry and Audit Right: Necessitate a national registry of data brokers with a public-facing website. Give individuals the right to be updated, check what data brokers hold, and the right to have that data permanently deleted. This is a critical digital defense, allowing individuals to manage profiles that could otherwise be used to link stolen data to a complete personal identity.
• Establish Data Broker Know Your Customer (KYC): Necessitate data brokers to adopt a Know Your Customer (KYC) policy before selling personal data. Just as financial institutions and crypto platforms verify the source and legitimacy of assets, data brokers must verify the origin, consent status, and intended use of personal data. Selling data without knowing who purchase it and whose it is, how it was obtained, or whether it is lawfully use and collected is analogous to laundering identity.
• Strengthen Protections for Sensitive Data: Necessitate clear and concise definition of sensitive data to include in data collection, when combined, and when personal profile are created. This would adhere data brokers to stricter controls and prohibit the creation of profiles that enable potential AI-driven fraud.
• Implement “Right to Be Forgotten” on AI: If used for AI training, necessitate law that enforces individual’s right to be inform, have personal information removed and untrained from the model’s dataset. To ensure that individuals have a path to determine, guard, and correct or delete inaccurate or sensitive data.
• Consensus Requirement for AI Training: At a federal or state law, necessitate a comprehensive and informed permission with full transparency for any personal data to be used in training or fine-tuning an AI model. This would prevent the use of quantitatively and qualitatively scraped, collected, or stolen data as the foundational fuel for new systems.
Conclusion
While the dynamics of technologies and emerging issues are shifting towards the mainstream, CCPA and existing laws reveal loopholes for liability and are no longer sufficient. As long as sensitive information and unconsented collection of personal information can be collected and unregulated, the interaction between AI and technology poses the greatest risks. To build a safer and fairer digital future, takes an initial step to regulate the source itself: the data supply chain. It is not enough to govern the final product, but guarding the component it is built from. The path forward requires proactive legislation that puts people first by ensuring trust, privacy, and security as the foundation of the AI age.
Works Cited
Brennan Center for Justice. “States Take the Lead on Regulating Artificial Intelligence.” Brennan Center for Justice, 1 Nov. 2023, brennancenter.org/our-work/research-reports/states-take-lead-regulating-artificial-intelligence. Accessed 21 Aug. 2025.
Collier, Kevin. “Hacker Used AI to Automate ‘Unprecedented’ Cybercrime Spree, Anthropic Says.” NBC News, 27 Aug. 2025, nbcnews.com/tech/security/hacker-used-ai-automate-unprecedented-cybercrime-spree-anthropic-says-rcna227309. Accessed 31 Aug. 2025.
Digital Life Initiative. “The Promise and Pitfalls of the California Consumer Privacy Act.” Cornell Tech Digital Life Initiative, 11 Apr. 2020, dli.tech.cornell.edu/post/the-promise-and-pitfalls-of-the-california-consumer-privacy-act. Accessed 21 Aug. 2025.
Electronic Privacy Information Center. “The State of State AI Laws: 2023.” EPIC.org, 3 Aug. 2023, epic.org/the-state-of-state-ai-laws-2023/. Accessed 21 Aug. 2025.
International Centre for Counter-Terrorism (ICCT). “The Exploitation of Generative AI by Terrorist Groups.” ICCT.nl, 10 June 2024, icct.nl/publication/exploitation-generative-ai-terrorist-groups. Accessed 29 Aug. 2025.
IAPP. “A Regulatory Roadmap to AI and Privacy.” IAPP.org, 1 Nov. 2023, iapp.org/news/a/a-regulatory-roadmap-to-ai-and-privacy. Accessed 21 Aug. 2025.
Jackson Lewis. “California Consumer Privacy Act, California Privacy Rights Act FAQs for Covered Businesses.” JacksonLewis.com, 19 Jan. 2022, jacksonlewis.com/insights/california-consumer-privacy-act-california-privacy-rights-act-faqs-covered-businesses. Accessed 21 Aug. 2025.
Pavlovic, Uros. “Staying GDPR Compliant When Using OSINT for Fraud Prevention.” Trustfull, 20 Mar. 2025, trustfull.com/articles/staying-gdpr-compliant-when-using-osint-for-fraud-prevention. Accessed 31 Aug. 2025.
PwC. “How State Privacy Laws Regulate AI: 6 Steps to Compliance.” PwC.com, pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/library/tech-regulatory-policy-developments/privacy-laws.html. Accessed 21 Aug. 2025.
RAND. “Artificial Intelligence Impacts on Privacy Law.” RAND.org, 8 Aug. 2024, rand.org/pubs/research_reports/RRA3243-2.html. Accessed 21 Aug. 2025.
Street Fight. “The California Consumer Privacy Act’s Promise and Limitations.” Street Fight Magazine, 6 Jan. 2020, streetfightmag.com/2020/01/06/the-california-consumer-privacy-acts-promise-and-limitations. Accessed 21 Aug. 2025.
Thomson Reuters. “The California Consumer Privacy Act (CCPA) — Legal Glossary.” Legal.thomsonreuters.com, 4 Mar. 2025, legal.thomsonreuters.com/blog/the-california-consumer-privacy-act/. Accessed 21 Aug. 2025.